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Sir: 

This Reply Brief is filed in response to the Examiner's Answer mailed Janury 29, 

2007. 

It is not believed that an extension of time and/or additional fee(s) are required, 
beyond those that may otherwise be provided for in documents accompanying this paper. In 
the event, however, that an extension of time is necessary to allow consideration of this 
paper, such an extension is hereby petitioned for under 37 C.F.R. §1.1 36(a). Any additional 
fees believed to be due in connection with this paper may be charged to Deposit Account No. 
50-0220. 

I. The Examiner's Answer - Response to Arguments (starting at Page 10) 

In the interest of brevity, Appellant will refrain herein from readdressing all of the 
deficiencies with the pending rejections and, therefore, hereby incorporates the arguments set 
out in Appellant's Brief on Appeal as if set forth in their entirety. Appellant's following 
remarks address the new points raised in the Examiner's Answer. 

Regarding the Examiner's response to Appellant's remarks on the recitations of 
"configured to" in Appellant's claims, Appellant respectfully reiterates that recitations, such 
as "configured to" that require steps be performed do limit claims to a particular structure. In 
particular, the present claims clearly recite a process to be carried out by the processor circuit 
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(such as the release of antibodies, waiting a predetermined time period, then activating 
radiation source, then waiting a second predetermined time interval, then sensing the voltages 
generated by radiation detector, etc.), which does impart a structure to the processor circuit. 

Appellant believes that the Examiner's citation to MPEP § 2106 indicates that the 
Examiner considers the recitation of "configured to" to be equivalent to language such as 
"whereby" or "adapted to." 1 However, Appellant respectfully points out that the MPEP § 
2106 also refers the reader to MPEP § 21 1 1 .04,which reads in-part: 

The determination of whether each of these clauses is a limitation in a 
claim depends on the specific facts of the case. In Hoffer v. Microsoft 
Corp., 405 F.3d 1326, 1329, 74 USPQ2d 1481, 1483 (Fed. Cir. 2005), the 
court held that when a "'whereby' clause states a condition that is material 
to patentability, it cannot be ignored in order to change the substance of 
the invention." Id 

In holding that the whereby clause did impart patentable weight to the claims, the court in 
Hoffer held that the capability of the system "is more than the intended result of a process 
step; it is part of the process itself. This interactive element is described in the specification 
and prosecution history as an integral part of the invention." Id. Therefore, Appellant 
respectfully maintains that the term "configured to" does means that the processor circuit 
does, in-fact, perform the recited process and, therefore, are not optional. 

With respect to the Examiner's remarks regarding language such as "algorithm", 
Appellant respectfully maintains that the absence of the word "algorithm," by itself, is not 
conclusive evidence that no step by step process is disclosed. See AT&T Corp. v. Excel 
Communications, Inc., 172 F.3d 1352, 1356 (Fed. Cir. 1999) ("This court recently pointed 
out that any step-by-step process, be it electrical chemical or mechanical, involves an 
"algorithm" in the broad sense of the term" (emphasis added)). Accordingly, Appellant 
respectfully maintains that the processor circuits, as recited in the present claims, are 
structurally different than the prior art despite the absence of the terminology suggested by 
the Examiner. 

1 Further, Appellant believes that the Examiner may actually be mis-understanding the term 
"configured to" to mean "configurable" based on the assertion that this recitation only 
"suggests or makes optional but does not require steps to be performed." Examiner's Answer, 
page 11. 
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With regard to the Examiner's remarks on the term "programmed" not being recited in 
the claims, Appellant respectfully points out that previously issued patents indicate that it is 
not necessary to include the specific word "programmed" in a processor claim when disclosed 
in the function of a processor circuit. For example, see U.S. Patent Nos. 6,496,187, 
6,457,1 17, and 6,876,704. (copies of which are attached hereto). Accordingly, Appellant 
respectfully maintains that the absence of the word "programmed" from the claims is not 
convincing evidence that the processor circuit claims are not structurally different. 

II. Conclusion 

For the reasons set forth in above and in Appellant's Brief on Appeal, Appellant requests 
reversal of the rejections, allowance of all claims and passing of the application to issue. 

Respectftilly ^submitted, 



Myers Bigel Sibley & Sajovec, P. A. 
P.O. Box 37428 
Raleigh, North Carolina 27627 
Telephone: (919) 854-1400 
Facsimile: (919) 854-1401 
Customer Number 20792 



I hereby certify that this correspondence is being transmitted via the Office electronic filing system in 
accordance with § 1 .6(a)(4) to the U.S. Patent and Trademark Office on February 22, 2007. 




CERTIFICATION OF TRANSMISSION 



Signature: 





IIIIIIIIIH 

US006457117B1 

(12) United States Patent ao) Patent No.: us 6,457,117 bi 

Witt (45) Date of Patent: Sep. 24, 2002 



(54) PROCESSOR CONFIGURED TO 

PREDECODE RELATIVE CONTROL 
TRANSFER INSTRUCTIONS AND REPLACE 
DISPLACEMENTS THEREIN WITH A 
TARGET ADDRESS 

(75) Inventor: David B. Witt, Austin, TX (US) 

(73) Assignee: Advanced Micro Devices, Inc., 

Sunnyvale, CA (US) 

( * ) Notice: Subject to any disclaimer, the term of this 
patent is extended or adjusted under 35 
U.S.C. 154(b) by 0 days. 

(21) Appl. No.: 09/708,216 

(22) Filed: Nov. 7, 2000 

Related U.S. Application Data 

(63) Continuation of application No. 09/065.681, filed on Apr. 

23, 1998, now Pat. No. 6,167,506. 
(60) Provisional application No. 60/065,878, filed on Nov. 17, 

1997. 

(51) Int. CI. 7 G06F 9/312 

(52) U.S. CI 712/213; 712/204; 712/206; 

712/207; 712/211; 712/213; 712/227; 712/235; 

712/237; 712/239 

(58) Field of Search 712/213, 204, 

712/206, 207, 211, 227, 235, 237, 239 

(56) References Cited 

U.S. PATENT DOCUMENTS 

4,502,111 A 2/1985 Riffe et al. 
5,101,341 A 3/1992 Circello et al. 

(List continued on next page.) 

FOREIGN PATENT DOCUMENTS 

EP 0 238 810 A2 9/1987 

EP 0 423 726 A2 4/1991 

(List continued on next page.) 



OTHER PUBLICATIONS 

Wallace et al., "Multiple Branch and Block Prediction," 
©1997 IEEE, pp. 94-103. 

Tamasulo, "An Efficient Algorithm For Exploiting Multiple 
Arithmetic Units," 1967, IBM Journal, pp. 25-33. 

(List continued on next page.) 

Primary Examiner — Richard L. Ellis 

Assistant Examiner — Gautam R. Patcl 

(74) Attorney, Agent, or Firm — Conley, Rose & Tayon, PC; 

Lawrence J. Merkel 



(57) 



ABSTRACT 



The processor is configured to predecode instruction bytes 
prior to their storage within an instruction cache. During the 
predecoding, relative branch instructions are detected. The 
displacement included within the relative branch instruction 
is added to the address corresponding to the relative branch 
instruction, thereby generating the target address. The pro- 
cessor replaces the displacement field of the relative branch 
instruction with an encoding of the target address, and stores 
the modified relative branch instruction in the instruction 
cache. The branch prediction mechanism may select the 
target address from the displacement field of the relative 
branch instruction instead of performing an addition to 
generate the target address. In one embodiment, relative 
branch instructions having eight bit and 32-bit displacement 
fields are included in the instruction set executed by the 
processor. Additionally, the processor employs predecode 
information (stored in the instruction cache with the corre- 
sponding instruction bytes) including a start bit and a control 
transfer bit corresponding to each instruction byte. The 
combination of the start bit indicating that the byte is the 
start of an instruction and the corresponding control transfer 
bit identifies the instruction as either a branch instruction or 
a non-branch instruction. For relative branch instructions 
including an eight bit displacement, the control transfer bit 
corresponding to the displacement field is used in conjunc- 
tion with the displacement field to store the encoded target 
address. Thirty-two bit displacement fields store the entirety 
of the target address, and hence the encoded target address 
comprises the target address. 

14 Claims, 15 Drawing Sheets 
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PROCESSOR CONFIGURED TO 
PREDECODE RELATIVE CONTROL 
TRANSFER INSTRUCTIONS AND REPLACE 
DISPLACEMENTS THEREIN WITH A 

TARGET ADDRESS 5 

This Application is a continuation of U.S. patent appli- 
cation Scr. No. 09/065,681, now U.S. Pat. No. 6,167,506, 
filed Apr. 23, 1998, which claims benefit of priority to the 
Provisional Application Ser. No. 60/065,878, entitled "High 10 
Frequency, Wide Issue Microprocessor" filed on Nov. 17, 
1997 by Witt. The Provisional Application is incorporated 
herein by reference in its entirety. 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 15 
This invention is related to the field of processors and, 

more particularly, to predecoding techniques within proces- 
sors. 

2. Description of the Related Art 20 
Superscalar processors attempt to achieve high perfor- 
mance by dispatching and executing multiple instructions 
per clock cycle, and by operating at the shortest possible 
clock cycle time consistent with the design. To the extent 
that a given processor is successful at dispatching and/or 2 s 
executing multiple instructions per clock cycle, high perfor- 
mance may be realized. In order to increase the average 
number of instructions dispatched per clock cycle, processor 
designers have been designing superscalar processors which 
employ wider issue rates. A "wide issue" superscalar pro- 30 
cessor is capable of dispatching (or issuing) a larger maxi- 
mum number of instructions per clock cycle than a "narrow 
issue" superscalar processor is capable of dispatching. Dur- 
ing clock cycles in which a number of dispatchable instruc- 
tions is greater than the narrow issue processor can handle, 35 
the wide issue processor may dispatch more instructions, 
thereby achieving a greater average number of instructions 
dispatched per clock cycle. 

Many processors are designed to execute the x86 instruc- 
tion set due to its widespread acceptance in the computer 40 
industry. For example, the K5 and K6 processors from 
Advanced Micro Devices, Inc., of Sunnyvale, Calif, imple- 
ment the x86 instruction set. The x86 instruction set is a 
variable length instruction set in which various instructions 
occupy differing numbers of bytes in memory. The type of 45 
instruction, as well as the addressing modes selected for a 
particular instruction encoding, may affect the number of 
bytes occupied by that particular instruction encoding. Vari- 
able length instruction sets, such as the x86 instruction set, 
minimize the amount of memory needed to store a particular 50 
program by only occupying the number of bytes needed for 
each instruction. In contrast, many RISC architectures 
employ fixed length instruction sets in which each instruc- 
tion occupies a fixed, predetermined number of bytes. 

Unfortunately, variable length instruction sets complicate 55 
the design of wide issue processors. For a wide issue 
processor to be effective, the processor must be able to 
identify large numbers of instructions concurrently and 
rapidly within a code sequence in order to provide sufficient 
instructions to the instruction dispatch hardware. Because 60 
the location of each variable length instruction within a code 
sequence is dependent upon the preceding instructions, rapid 
identification of instructions is difficult. If a sufficient num- 
ber of instructions cannot be identified, the wide issue 
structure may not result in significant performance gains. 65 
Therefore, a processor which provides rapid and concurrent 
identification of instructions for dispatch is needed. 
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Another feature which is important to the performance 
achievable by wide issue superscalar processors is the 
accuracy and effectiveness of its branch prediction mecha- 
nism. As used herein, the branch prediction mechanism 
refers to the hardware which detects control transfer instruc- 
tions within the instructions being identified for dispatch and 
which predicts the next fetch address resulting from the 
execution of the identified control transfer instructions. 
Generally, a "control transfer" instruction is an instruction 
which, when executed, specifies the address from which the 
next instruction to be executed is fetched. Jump instructions 
are an example of control transfer instructions. A jump 
instruction specifies a target address different than the 
address of the byte immediately following the jump instruc- 
tion (the "sequential address"). Unconditional jump instruc- 
tions always cause the next instruction to be fetched to be the 
instruction at the target address, while conditional jump 
instructions cause the next instruction be fetched to be either 
the instruction at the target address or the instruction at the 
sequential address responsive to an execution result of a 
previous instruction (for example, by specifying a condition 
flag set via instruction execution). Other types of instruc- 
tions besides jump instructions may also be control transfer 
instructions. For example, subroutine call and return instruc- 
tions may cause stack manipulations in addition to specify- 
ing the next fetch address. Many of these additional types of 
control transfer instructions include a jump operation (either 
conditional or unconditional) as well as additional instruc- 
tion operations. 

Control transfer instructions may specify the target 
address in a variety of ways. "Relative" control transfer 
instructions include a value (either directly or indirectly) 
which is to be added to an address corresponding to the 
relative control transfer instruction in order to generate the 
target address. The address to which the value is added 
depends upon the instruction set definition. For x86 control 
transfer instructions, the address of the byte immediately 
following the control transfer instruction is the address to 
which the value is added. Other instruction sets may speci- 
fying adding the value to the address of the control transfer 
instruction itself. For relative control transfer instructions 
which directly specify the value to be added, an instruction 
field is included for storing the value and the value is 
referred to as a "displacement". 

On the other hand, "absolute" control transfer instructions 
specify the target address itself (again, either directly or 
indirectly). Absolute control transfer instructions therefore 
do not require an address corresponding to the control 
transfer instruction to determine the target address. Control 
transfer instructions which specify the target address indi- 
rectly (e.g. via one or more register or memory operands) are 
referred to as "indirect" control transfer instructions. 

Because of the variety of available control transfer 
instructions, the branch prediction mechanism may be quite 
complex. However, because control transfer instructions 
occur frequently in many program sequences, wide issue 
processors have a need for a highly effective (e.g. both 
accurate and rapid) branch prediction mechanism. If the 
branch prediction mechanism is not highly accurate, the 
wide issue processor may issue a large number of instruc- 
tions per clock cycle but may ultimately cancel many of the 
issued instructions due to branch mispredictions. On the 
other hand, the number of clock cycles used by the branch 
prediction mechanism to generate a target address needs to 
be minimized to allow for the instructions that the target 
address to be fetched. 

The term "branch instruction" is used herein to be syn- 
onymous with "control transfer instruction". 
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SUMMARY OF THE INVENTION 

The problems outlined above are in large part solved by 
a processor in accordance with the present invention. The 
processor is configured to predecode instruction bytes prior 
to their storage within an instruction cache. During the 
predecoding, relative branch instructions are detected. The 
displacement included within the relative branch instruction 
is added to the address corresponding to the relative branch 
instruction, thereby generating the target address. The pro- 
cessor replaces the displacement field of the relative branch 
instruction with an encoding of the target address, and stores 
the modified relative branch instruction in the instruction 
cache. Advantageously, the branch prediction mechanism 
employed by the processor may more rapidly generate the 
target address corresponding to relative branch instructions. 
The branch prediction mechanism may simply select the 
target address from the displacement field of the relative 
branch instruction instead of performing an addition to 
generate the target address. The rapidly generated target 
address may be provided to the instruction cache for fetch- 
ing instructions more quickly than might otherwise be 
achieved. The amount of time elapsing between fetching a 
branch instruction and generating the corresponding target 
address may advantageously be reduced. Accordingly, the 
branch prediction mechanism may operate more efficiently, 
and hence processor performance may be increased through 
more rapid fetching of instructions stored at the target 
address. Superscalar processors may thereby support wider 
issue rates by fetching a larger number of instructions in a 
given period of time. 

In one embodiment, relative branch instructions having 
eight bit and 32-bit displacement fields arc included in the 
instruction set executed by the processor. Additionally, the 
processor employs predecode information (stored in the 
instruction cache with the corresponding instruction bytes) 
including a start bit and a control transfer bit corresponding 
to each instruction byte. The combination of the start bit 
indicating that the byte is the start of an instruction and the 
corresponding control transfer bit identifies the instruction 
as either a branch instruction or a non-branch instruction. 
For relative branch instructions including an eight bit 
displacement, the control transfer bit corresponding to the 
displacement field is used in conjunction with the displace- 
ment field to store the encoded target address. The encoded 
target address includes a cache line offset portion and a 
relative cache line portion identifying the target cache line as 
a function of the cache line storing the relative branch 
instruction. Thirty-two bit displacement fields store the 
entirety of the target address, and hence the encoded target 
address comprises the target address. Other embodiments 
than the one described above are contemplated. 

Broadly speaking, the present invention contemplates a 
processor comprising a predecode unit and an instruction 
cache. The predecode unit is configured to predecode a 
plurality of instruction bytes received by the processor. 
Upon predecoding a relative control transfer instruction 
comprising a displacement, the predecode unit adds an 
address to the displacement to generate a target address 
corresponding to the relative control transfer instruction. 
Additionally, the predecode unit is configured to replace the 
displacement within the relative control transfer instruction 
with at least a portion of the target address. Coupled to the 
predecode unit, the instruction cache is configured to store 
the plurality of instruction bytes including the relative 
control transfer instruction with the displacement replaced 
by the portion of the target address. 
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The present invention further contemplates a method for 
generating a target address for a relative control transfer 
instruction. A plurality of instruction bytes including the 
relative transfer instruction are predecoded to detect the 
5 presence of the relative control transfer instruction. An 
address is added to a displacement included in the relative 
control transfer instruction, thereby generating the target 
address. The displacement is replaced within the relative 
control transfer instruction with an encoding indicative of 
the target address. The plurality of instruction bytes includ- 
ing the relative control transfer instruction is stored in an 
instruction cache, with the displacement replaced by the 
encoding. 

Moreover, the present invention contemplates a prede- 
code unit comprising a decoder and a target generator. The 

15 decoder is configured to decode a plurality of instruction 
bytes and to identify a relative control transfer instruction 
therein. The target generator is configured to add a displace- 
ment selected from the relative control transfer instruction to 
an address, thereby generating a target address corrcspond- 

20 ing to the relative control transfer instruction, and is further 
configured to generate an encoding of the target address with 
which the predecode unit replaces the displacement within 
the relative control transfer instruction. 

The present invention still further contemplates a com- 

25 puter system comprising a processor, a memory, and an 
input/output (I/O) device. The processor is configured to 
predecode a plurality of instruction bytes received by the 
processor. Upon predecoding a relative control transfer 
instruction comprising a displacement, the processor is 

30 configured to add an address to the displacement to generate 
a target address corresponding to the relative control transfer 
instruction. Additionally, the processor is configured to 
replace the displacement within the relative control transfer 
instruction with at least a portion of the target address. 
Coupled to the processor, the memory is configured to store 
the plurality of instruction bytes and to provide the instruc- 
tion bytes to the processor. The I/O device is configured to 
transfer data between the computer system and another 
computer system coupled to the I/O device. 

40 BRIEF DESCRIPTION OF THE DRAWINGS 

Other objects and advantages of the invention will 
become apparent upon reading the following detailed 
description and upon reference to the accompanying draw- 
ings in which: 

45 FIG. 1 is a block diagram of one embodiment of a 
superscalar processor. 

FIG. 2 is a block diagram of one embodiment of a 
fetch/scan unit shown in FIG. 1. 

FIG. 3 is a block diagram of one embodiment of a decode 

50 

and lookahead/collapse unit shown in FIG. 1. 

FIG. 4 is a block diagram of one embodiment of a 
predecode unit shown in FIG. 1. 

FIG. 4A is a block diagram of one embodiment of a target 
55 generator shown in FIG. 4. 

FIG. 5 is a diagram illustrating a control transfer instruc- 
tion having an 8-bit offset and the corresponding predecode 
information according to one embodiment of the processor 
shown in FIG. 1. 
60 FIG. 6 is a diagram illustrating a control transfer instruc- 
tion having a 32-bit offset and the corresponding predecode 
information according to one embodiment of the processor 
shown in FIG. 1. 

FIG. 7 is a diagram illustrating several non-control trans- 
65 fer instructions and the corresponding predecode informa- 
tion according to one embodiment of the processor shown in 
FIG. 1. 
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FIG. 8 is a block diagram of one embodiment of a branch 
scanner shown in FIG. 2. 

FIG. 9 is a block diagram of one embodiment of a prefetch 
control unit shown in FIG. 2. 

FIG. 10 is a truth table for one embodiment of the decoder 5 
shown in FIG. 9. 

FIG. 10A is a flowchart illustrating operation of one 
embodiment of the decoder shown in FIG. 9. 

FIG. 11 is a flowchart illustrating operation of one 
embodiment of the LI prefetch control unit shown in FIG. 
9. 

FIG. 12 is a table illustrating instruction fetch and dis- 
patch results for one embodiment of the processor shown in 
FIG. 1 in which up to two branch instructions are predicted 15 
per clock cycle. 

FIG. 13 is a block diagram of one embodiment of an 
instruction queue illustrated in FIG. 1. 

FIG. 14 is a block diagram of one embodiment of a future 

70 

file, register file, and reorder buffer shown in FIG. 1. 

FIG. 15 is a block diagram of one embodiment of a 
computer system including the processor shown in FIG. 1. 

While the invention is susceptible to various modifica- 
tions and alternative forms, specific embodiments thereof 95 
are shown by way of example in the drawings and will 
herein be described in detail. It should be understood, 
however, that the drawings and detailed description thereto 
are not intended to limit the invention to the particular form 
disclosed, but on the contrary, the intention is to cover all 30 
modifications, equivalents and alternatives falling within the 
spirit and scope of the present invention as defined by the 
appended claims. 

DETAILED DESCRIPTION OF THE 

INVENTION 35 

Turning now to FIG. 1, a block diagram of one embodi- 
ment of a superscalar processor 10 is shown. Other embodi- 
ments are possible and contemplated. In the embodiment 
shown in FIG. 1, processor 10 includes a predecode unit 12, 40 
an LI I-cache 14, an L0 I-cache 16, a fetch/scan unit 18, an 
instruction queue 20, an alignment unit 22, a lookahcad/ 
collapse unit 24, a future file 26, a reorder buffer/register file 
28, a first instruction window 30A, a second instruction 
window 30B, a plurality of functional units 32 A, 32B, 32 C, 45 
and 32D, a plurality of address generation units 34 A, 34B, 
34C, and 34D, a load/store unit 36, an LI D-cache 38, an 
FPU/multimedia unit 40, and an external interface unit 42. 
Elements referred to herein by a particular reference number 
followed by various letters will be collectively referred to 50 
using the reference number alone. For example, functional 
units 32 A, 32B, 32C, and 32D will be collectively referred 
to as functional units 32. 

In the embodiment of FIG. 1, external interface unit 42 is 
coupled to predecode unit 12, LI D-cache 38, an L2 inter- 55 
face 44, and a bus interface 46. Predecode unit 12 is further 
coupled to LI I-cache 14. LI I-cache 14 is coupled to L0 
I-cache 16 and to fetch/scan unit 18. Fetch/scan unit 18 is 
also coupled to L0 I-cache 16 and to instruction queue 20. 
Instruction queue 20 is coupled to alignment unit 22, which 60 
is further coupled to lookahe ad/collapse unit 24. Lookahead/ 
collapse unit 24 is further coupled to future file 26, reorder 
buffer/register file 28, load/store unit 36, first instruction 
window 30A, second instruction window 30B, and FPU/ 
multimedia unit 40. FPU/multimedia unit 40 is coupled to 65 
load/store unit 36 and to reorder buffer/register file 28. 
Load/store unit 36 is coupled to LI D-cache 38. First 



,117 Bl 

6 

instruction window 30A is coupled to functional units 
32A-32B and to address generation units 34A-34B. 
Similarly, second instruction window 30B is coupled to 
functional units 32C-32D and address generation units 
34C-34D. Each of LI D-cache 38, functional units 32, and 
address generation units 34 are coupled to a plurality of 
result buses 48 which are further coupled to load/store unit 
36, first instruction window 30A, second instruction window 
30B, reorder buffer/register file 28, and future file 26. 

Predecode unit 12 receives instruction bytes fetched by 
external interface unit 42 and predecodes the instruction 
bytes prior to their storage within LI I-cache 14. Predecode 
information generated by predecode unit 12 is stored in LI 
I-cache 14 as well. Generally, predecode information is 
provided to aid in the identification of instruction features 
which may be useful during the fetch and issue of instruc- 
tions but which may be difficult to generate rapidly during 
the fetch and issue operation. The term "predecode", as used 
herein, refers to decoding instructions to generate predecode 
information which is later stored along with the instruction 
bytes being decoded in an instruction cache (e.g. LI I-cache 
14 and/or L0 I-cache 16). 

In one embodiment, processor 10 employs two bits of 
predecode information per instruction byte. One of the bits, 
referred to as the "start bit", indicates whether or not the 
instruction byte is the initial byte of an instruction. When a 
group of instruction bytes is fetched, the corresponding set 
of start bits identifies the boundaries between instructions 
within the group of instruction bytes. Accordingly, multiple 
instructions may be concurrently selected from the group of 
instruction bytes by scanning the corresponding start bits. 
While start bits are used to locate instruction boundaries by 
identifying the initial byte of each instruction, end bits could 
alternatively be used to locate instruction boundaries by 
identifying the final byte of each instruction. 

The second predecode bit used in this embodiment, 
referred to as the "control transfer" bit, identifies which 
instructions are branch instructions. The control transfer bit 
corresponding to the initial byte of an instruction indicates 
whether or not the instruction is a branch instruction. The 
control transfer bit corresponding to subsequent bytes of the 
instruction is a don't care except for relative branch instruc- 
tions having a small displacement field. According to one 
particular embodiment, the small displacement field is an 8 
bit field. Generally, a "small displacement field" refers to a 
displacement field having fewer bits than the target address 
generated by branch instructions. For relative branch 
instructions having small displacement fields, the control 
transfer bit corresponding to the displacement byte is used as 
described below. 

In addition to generating predecode information corre- 
sponding to the instruction bytes, predecode unit 12 is 
configured to recode the displacement field of relative 
branch instructions to actually store the target address in the 
present embodiment. In other words, predecode unit 12 adds 
the displacement of the relative branch instruction to the 
address corresponding to the relative branch instruction as 
defined by the instruction set employed by processor 10. The 
resulting target address is encoded into the displacement 
field as a replacement for the displacement, and the updated 
displacement field is stored into LI I-cache 14 instead of the 
original displacement field. Target address generation is 
simplified by precomputing relative target addresses, and 
hence the branch prediction mechanism may operate more 
efficiently. 

In one embodiment of processor 10 which employs the 
x86 instruction set, predecode unit 12 is configured to recode 
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eight bit and 32 bit displacement fields. The 32 bit displace- 
ment fields may store the entirety of the target address. On 
the other hand, the eight bit displacement field is encoded. 
More particularly, the eight bit displacement field and cor- 
responding control transfer predecode bit is divided into a 
cache line offset portion and a relative cache line portion. 
The cache line offset portion is the cache line offset portion 
of the target address. The relative cache line portion defines 
the cache line identified by the target address (the "target 
cache line") in terms of a number of cache lines above or 
below the cache line storing the relative branch instruction. 
A first cache line is above a second cache line if each byte 
within the first cache line is stored at an address which is 
numerically greater than the addresses at which the bytes 
within the second cache line are stored. Conversely, a first 
cache line is below the second cache line if each byte within 
the first cache line is stored at an address which is numeri- 
cally less than the addresses which the bytes within a second 
cache line are stored. A signed eight bit displacement 
specifies an address which is +/-128 bytes of the address 
corresponding to the branch instruction. Accordingly, the 
number of above and below cache lines which can be 
reached by a relative branch instruction having an eight bit 
displacement is limited. The relative cache line portion 
encodes this limited set of above and below cache lines. 

Tables f and 2 below illustrates an exemplary encoding of 
the predecode information corresponding to a byte in accor- 
dance with one embodiment of processor 10. 

TABLE 1 



Predecode Encoding 

Start Control 

Bit Transfer Bit Meaning 

1 0 Start byte of an instruction which is not a branch. 

1 1 Start byte of a branch instruction. 

0 x Not an instruction boundary. Control Transfer Bit 

corresponding to displacement is used on 8-bit 
relative branches to encode target address as shown 
in Table 2 below. 



TABLE 2 



'target Address Encoding 

Control 



Transfer 
Bit 


Displacement Byte 

Most Significant Bits (binary) 


Meaning 


0 


00 


Within Current Cache Line 


0 


01 


One Cache Line Above 


0 


10 


Two Cache Lines Above 


1 


01 


One Cache Line Below 


1 


10 


Two Cache Lines Below 



Note: Remaining displacement byte bits are the offset within the target 
cache line. Control Transfer Bit is effectively a direction, and the most 
significant bits of the displacement byte are the number of cache lines. 



Predecode unit 12 conveys the received instruction bytes 
and corresponding predecode information to LI I-cache 14 
for storage. LI I-cache 14 is a high speed cache memory for 
storing instruction bytes and predecode information. LI 
I-cache 14 may employ any suitable configuration, including 
direct mapped and set associative configurations. In one 
particular embodiment, LI I-cache 14 is a 128 KB, two way 
set associative cache employing 64 byte cache lines. LI 
I-cache 14 includes additional storage for the predecode 
information corresponding to the instruction bytes stored 
therein. The additional storage is organized similar to the 
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instruction bytes storage. As used herein, the term "cache 
line" refers to the unit of allocation of storage in a particular 
cache. Generally, the bytes within a cache line are manipu- 
lated (i.e. allocated and deallocated) by the cache as a unit. 

5 In one embodiment, LI I-cache 14 is linearly addressed 
and physically tagged. A cache is linearly addressed if at 
least one of the address bits used to index the cache is a 
linear address bit which is subsequently translated to a 
physical address bit. The tags of a linearly address/ 

10 physically tagged cache include each translated bit in addi- 
tion to the bits not used to index. As specified by the x86 
architecture, instructions are defined to generate logical 
addresses which are translated through a segmentation trans- 
lation mechanism to a linear address and further translated 

15 through a page translation mechanism to a physical address. 
It is becoming increasingly common to employ flat address- 
ing mode, in which the logical address and corresponding 
linear address are equal. Processor 10 may be configured to 
assume flat addressing mode. Accordingly, fetch addresses, 

20 target addresses, etc. as generated by executing instructions 
are linear addresses. In order to determine if a hit is detected 
in LI I-cache 14, the linear address presented thereto by 
fetch/scan unit 18 is translated using a translation lookaside 
buffer (TLB) to a corresponding physical address which is 

95 compared to the physical tags from the indexed cache lines 
to determine a hit/miss. When flat addressing mode is not 
used, processor 10 may still execute code but additional 
clock cycles may be used to generate linear addresses from 
logical addresses. 

30 LO I-cache 16 is also a high speed cache memory for 
storing instruction bytes. Because LI I-cache 14 is large, the 
access time of LI I-cache 14 may be large. In one particular 
embodiment, LI I-cache 14 uses a two clock cycle access 
time. In order to allow for single cycle fetch access, LO 

35 I-cache 16 is employed. LO I-cache 16 is comparably smaller 
than LI I-cache 14, and hence may support a more rapid 
access time. In one particular embodiment, L0 I-cache 16 is 
a 512 byte fully associative cache. Similar to LI I-cache 14, 
LO I-cache 16 is configured to store cache lines of instruction 

40 bytes and corresponding predecode information (e.g. 512 
bytes stores eight 64 byte cache lines and corresponding 
predecode data is stored in additional storage). In one 
embodiment, L0 I-cache 16 may be linearly addressed and 
linearly tagged. 

45 Fetch/scan unit 18 is configured to generate fetch 
addresses for L0 I-cache 16 and prefetch addresses for LI 
I-cache 14. Instructions fetched from L0 I-cache 16 are 
scanned by fetch/scan unit 18 to identify instructions for 
dispatch as well as to locate branch instructions and to form 

so branch predictions corresponding to the located branch 
instructions. Instruction scan information and corresponding 
instruction bytes are stored into instruction queue 20 by 
fetch/scan unit 18. Additionally, the identified branch 
instructions and branch predictions are used to generate 

55 subsequent fetch addresses for LO I-cache 16. 

Fetch/scan unit 18 employs a prefetch algorithm to 
attempt to prefetch cache lines from LI I-cache 14 to LO 
I-cache 16 prior to the prefetched cache fines being fetched 
by fetch scan unit 18 for dispatch into processor 10. Any 

60 suitable prefetch algorithm may be used. In one 
embodiment, fetch/scan unit 18 is configured to prefetch the 
next sequential cache line to a cache line fetched from L0 
I-cache 16 during a particular clock cycle unless: (i) a branch 
misprediction is signalled; (ii) an LO I-cache miss is 

65 detected; or (iii) a target address is generated which is 
assumed to miss L0 I-cache 16. In one particular 
embodiment, relative branch instructions employing 32-bit 
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displacements and branch instructions employing indirect 
target address generation are assumed to miss LO 1-cache 16. 
For case (i), fetch/scan unit 18 prefetches the cache line 
sequential to the corrected fetch address. For cases (ii) and 
(lii), fetch/scan unit 18 prefetches the corresponding miss or 
target address. 

Fetch/scan unit 18 employs an aggressive branch predic- 
tion mechanism in attempt to fetch larger "runs" of instruc- 
tions during a clock cycle. As used herein, a "run" of 
instructions is a set of one or more instructions predicted to 
be executed in the sequence specified within the set. For 
example, fetch/scan unit 18 may fetch runs of 24 instruction 
bytes from LO I-cache 16. Each run is divided into several 
sections which fetch/scan unit 18 scans in parallel to identify 
branch instructions and to generate instruction scan infor- 
malion for instruction queue 20. According to one 
embodiment, fetch/scan unit 18 attempts to predict up to two 
branch instructions per clock cycle in order support large 
instruction runs. 

Instruction queue 20 is configured to store instruction 
bytes provided by fetch/scan unit 18 for subsequent dis- 
patch. Instruction queue 20 may operate as a first-in, first-out 
(FIFO) buffer. In one embodiment, instruction queue 20 is 
configured to store multiple entries, each entry comprising: 
a run of instructions, scan data identifying up to five 
instructions within each section of the run, and addresses 
corresponding to each section of the run. Additionally, 
instruction queue 20 may be configured to select up to six 
instructions within up to four consecutive run sections for 
presentation to alignment unit 22. Instruction queue 20 may, 
for example, employ 2-3 entries. An exemplary embodiment 
of instruction queue 20 is illustrated below in FIG. 13. 

Alignment unit 22 is configured to route instructions 
identified by instruction queue 20 to a set of issue positions 
within lookahead/collapse unit 24. In other words, alignment 
unit 22 selects the bytes which form each instruction from 
the run sections provided by instruction queue 20 responsive 
to the scan information provided by instruction queue 20. 
The instructions are provided into the issue positions in 
program order (i.e. the instruction which is first in program 
order is provided to the first issue position, the second 
instruction in program order is provided to the second issue 
position, etc.). 

Lookahead/collapse unit 24 decodes the instructions pro- 
vided by alignment unit 22. FPU/multimedia instructions 
detected by lookahead/collapse unit 24 are routed to FPU/ 
multimedia unit 40. Other instructions are routed to first 
instruction window 30A, second instruction window 30B, 
and/or load/store unit 36. In one embodiment, a particular 
instruction is routed to one of first instruction window 30A 
or second instruction window 30B based upon the issue 
position to which the instruction was aligned by alignment 
unit 22. According to one particular embodiment, instruc- 
tions from alternate issue positions arc routed to alternate 
instruction windows 30A and 30B. For example, instruc- 
tions from issue positions zero, two, and four may be routed 
to the first instruction window 30A and instructions from 
issue positions one, three, and five may be routed to the 
second instruction window 30B. Instructions which include 
a memory operation are also routed to load/store unit 36 for 
access to LI D -cache 38. 

Additionally, lookahead/collapse unit 24 attempts to gen- 
erate lookahead addresses or execution results for certain 
types of instructions. Lookahead address/result generation 
may be particularly beneficial for embodiments employing 
the x86 instruction set. Because of the nature the x86 



57,117 Bl 

10 

instruction set, many of the instructions in a typical code 
sequence are versions of simple moves. One reason for this 
feature is that x86 instructions include two operands, both of 
which are source operands and one of which is a destination 

5 operand. Therefore, one of the source operands of each 
instruction is overwritten with an execution result. 
Furthermore, the x86 instruction set specifies very few 
registers for storing register operands. Accordingly, many 
instructions are moves of operands to and from a stack 

1Q maintained within memory. Still further, many the instruc- 
tion dependencies are dependencies upon the ESP/EBP 
registers and yet many of the updates to these registers are 
increments and decrements of the previously stored values. 
To accelerate the execution of these instructions, 

15 lookahead/collapse unit 24 generates lookahead copies of 
the ESP and EBP registers for each of instructions decoded 
during a clock cycle. Additionally, lookahead/collapse unit 
24 accesses future file 26 for register operands selected by 
each instruction. For each register operand, future file 26 

20 may be storing either an execution result or a tag identifying 
a reorder buffer result queue entry corresponding to the most 
recent instruction having that register as a destination oper- 
and. 

In one embodiment, lookahead/collapse unit 24 attempts 

25 to perform an address calculation for each instruction which: 
(i) includes a memory operand; and (ii) register operands 
used to form the address of the memory operand are avail- 
able from future file 26 or lookahead copies of ESP/EBP. 
Additionally, lookahead/collapse unit 24 attempts to per- 

30 form a result calculation for each instruction which: (i) does 
not include a memory operand; (ii) specifies an add/subtract 
operation (including increment and decrement); and (iii) 
register operands arc available from future file 26 or looka- 
head copies of ESP/EBP. In this manner, many simple 

35 operations may be completed prior to instructions being sent 
to instruction windows 30A-30B. 

Lookahead/collapse unit 24 detects dependencies 
between a group of instructions being dispatched and col- 
lapses any execution results generated therein into instruc- 

40 tions dependent upon those instruction results. Additionally, 
lookahead/collapse unit 24 updates future file 26 with the 
lookahead execution results. Instruction operations which 
are completed by lookahead/collapse unit 24 (i.e. address 
generations and/or instruction results are generated and 

45 load/store unit 36 or future file 26 and the result queue are 
updated) are not dispatched to instruction windows 
30A-30B. 

Lookahead/collapse unit 24 allocates a result queue entry 
in reorder buffer/register file 28 for each instruction dis- 

50 patched. In one particular embodiment, reorder buffer/ 
register file 28 includes a result queue organized in a 
line-oriented fashion in which storage locations for execu- 
tion results are allocated and deallocated in lines having 
enough storage for execution results corresponding to a 

55 maximum number of concurrently dispatchable instructions. 
If less than the maximum number of instructions are 
dispatched, then certain storage locations within the line are 
empty. Subsequently dispatched instructions use the next 
available line, leaving the certain storage locations empty. In 

60 one embodiment, the result queue includes 40 lines, each of 
which may store up to six execution results corresponding to 
concurrently dispatched instructions. Execution results are 
retired from the result queue in order into the register file 
included within reorder buffer/register file 28. Additionally, 

65 the reorder buffer handles branch mispredictions, transmit- 
ting the corrected fetch address generated by the execution 
of the branch instruction to fetch/scan unit 18. Similarly, 
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instructions which generate other exceptions are handled and an explicit functional operation handled by functional 

within the reorder buffer. Results corresponding to instruc- units 32. Finally, instructions which do not have an explicit 

tions subsequent to the exception-generating instruction are functional operation are handled by load/store unit 36. Each 

discarded by the reorder buffer. The register file comprises memory operation results in an address generation handled 

a storage location for each architected register. For example, 5 either by lookahe ad/collapse unit 24 or address generation 

the x86 instruction set defines 8 architected registers. The units 34. Memory operations and instructions (i.e. functional 

register file for such an embodiment includes eight storage operations) may be referred to herein separately, but may be 

locations. The register file may further include storage sourced from a single instruction. 

locations used as temporary registers by a microcode unit in Address generation units 34 are configured to perform 

embodiments employing microcode units. Further details of 1Q address generation operations, thereby generating addresses 

one exemplary embodiment of future file 26 and reorder for memory operations in load/store unit 36. The generated 

buffer/register file 28 are illustrated in FIG. 14 below. addresses are forwarded to load/store unit 36 via result buses 

Future file 26 maintains the speculative state of each 48. Functional units 32 are configured to perform integer 

architected register as instructions are dispatched by arithmetic/logical operations and execute branch instruc- 

lookahe ad/collapse unit 24. As an instruction having a 15 tions. Execution results are forwarded to future file 26, 

register destination operand is decoded by lookahe ad/ reorder buffer/register file 28, and instruction windows 

collapse unit 24, the tag identifying the storage location 30A-30B via result buses 48. Address generation units 34 

within the result queue portion of reorder buffer/register file and functional units 32 convey the result queue tag assigned 

28 assigned to the instruction is stored into the future tile 26 to the instruction being executed upon result buses 48 to 

storage location corresponding to that register. When the 2Q identify the instruction being executed. In this manner, 

corresponding execution result is provided, the execution future file 26, reorder buffer/register file 28, instruction 

result is stored into the corresponding storage location windows 30A-30B, and load/store unit 36 may identify 

(assuming that a subsequent instruction which updates the execution results with the corresponding instruction. FPU/ 

register has not been dispatched). multimedia unit 40 is configured to execute floating point 

It is noted that, in one embodiment, a group of up to six 25 and multimedia instructions, 

instructions is selected from instruction queue 20 and moves Load/store unit 36 is configured to interface with LI 

through the pipeline within lookahe ad/collapse unit 24 as a D -cache 38 to perform memory operations. A memory 

unit. If one or more instructions within the group generates operation is a transfer of data between processor 10 and an 

a stall condition, the entire group stalls. An exception to this external memory. The memory operation may be an explicit 

rule is if lookahead/collapse unit 24 generates a split line 30 instruction, or may be implicit portion of an instruction 

condition due to the number of ESP updates within the which also includes operations to be executed by functional 

group. Such a group of instructions is referred to as a "line" units 32. Load memory operations specify a transfer of data 

of instructions herein. from external memory to processor 10, and store memory 

Instruction windows 30 receive instructions from operations specify a transfer of data from processor 10 to 

lookahead/collapse unit 24. Instruction windows 30 store the 35 external memory. If a hit is detected for a memory operation 

instructions until the operands corresponding to the instruc- within LI D -cache 38, the memory operation is completed 

tions are received, and then select the instructions for therein without access to external memory. Load/store unit 

execution. Once the address operands of an instruction 36 may receive addresses for memory operations from 

including a memory operation have been received, the lookahead/collapse unit 24 (via lookahead address 

instruction is transmitted to one of the address generation 40 calculation) or from address generation units 34. In one 

units 34. Address generation units 34 generate an address embodiment, load/store unit 36 is configured perform up to 

from the address operands and forward the address to three memory operations per clock cycle to LI D -cache 38. 

load/store unit 36. On the other hand, once the execution For this embodiment, load/store unit 36 may be configured 

operands of an instruction have been received, the instruc- to buffer up to 30 load/store memory operations which have 

tion is transmitted to one of the functional units 32 for 45 not yet accessed D -cache 38. The embodiment may further 

execution. In one embodiment, each integer window be configured to include a 96 entry miss buffer for buffering 

30A-30B includes 25 storage locations for instructions. load memory operations which miss D-cache 38 and a 32 

Each integer window 30A-30B is configured to select up to entry store data buffer. Load/store unit 36 is configured to 

two address generations and two functional unit operations perform memory dependency checking between load and 

for execution each clock cycle in the address generation 50 store memory operations. 

units 34 and functional units 32 connected thereto. In one LI D-cache 38 is a high speed cache memory for storing 
embodiment, instructions fetched from L0 I-cache 16 data. Any suitable configuration may be used for LI D-cache 
remain in the order fetched until stored into one of instruc- 38, including set associative and direct mapped configura- 
tion windows 30, at which point the instructions may be tions. In one particular embodiment, LI D-cache 38 is a 128 
executed out of order. 55 KB two way set associative cache employing 64 byte lines. 

In embodiments of processor 10 employing the x86 LI D-cache 38 may be organized as, for example, 32 banks 

instruction set, an instruction may include implicit memory of cache memory per way. Additionally, LI D-cache 38 may 

operations for load/store unit 36 as well as explicit func- be a linearly addressed/physically tagged cache employing a 

tional operations for functional units 32. Instructions having TLB similar to LI I-cache 14. 

no memory operand do not include any memory operations, 60 External interface unit 42 is configured to transfer cache 
and are handled by functional units 32. Instructions having lines of instruction bytes and data bytes into processor 10 in 
a source memory operand and a register destination operand response to cache misses. Instruction cache lines are routed 
include an implicit load memory operation handled by to predecode unit 12, and data cache lines are routed to LI 
load/store unit 36 and an explicit functional operation D-cache 38. Additionally, external interface unit 42 is con- 
handled by functional units 32. Instructions having a 65 figured to transfer cache lines discarded by LI D-cache 38 
memory source/destination operand include implicit load to memory if the discarded cache lines have been modified 
and store memory operations handled by load/store unit 36 to processor 10. As shown in FIG. 1, external interface unit 
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42 is configured to interface to an external L2 cache via L2 
interface 44 as well as to interface to a computer system via 
bus interface 46. In one embodiment, bus interface unit 46 
comprises an EV/6 bus interface. 

Turning now to FIG. 2, a block diagram of one embodi- 
ment of fetch/scan unit 18 is shown. Other embodiments are 
possible and contemplated. As shown in FIG. 2, fetch/scan 
unit 18 includes a prefetch control unit 50, a plurality of 
select next blocks 52A-52C, an instruction select multi- 
plexor (mux) 54, an instruction scanner 56, a branch scanner 
58, a branch history table 60, a branch select mux 62, a 
return stack 64, an indirect address cache 66, and a forward 
collapse unit 68. Prefetch control unit 50 is coupled to LI 
I -cache 14, L0 I -cache 16, indirect address cache 66, return 
stack 64, branch history table 60, branch scanner 58, and 
instruction select mux 54. Select next block 52Ais coupled 
to LI I-cache 14, while select next blocks 52B-52C are 
coupled to L0 I-cache 16. Each select next block 52 is 
coupled to instruction select mux 54, which is further 
coupled to branch scanner 58 and instruction scanner 56. 
Instruction scanner 56 is coupled to instruction queue 20. 
Branch scanner 58 is coupled to branch history table 60, 
return stack 64, and branch select mux 62. Branch select 
mux 62 is coupled to indirect address cache 66. Branch 
history table 60 and branch scanner 58 are coupled to 
forward collapse unit 68, which is coupled to instruction 
queue 20. 

Prefetch control unit 50 receives branch prediction infor- 
mation (including target addresses and taken/not taken 
predictions) from branch scanner 58, branch history table 
60, return stack 64, and indirect address cache 66. Respon- 
sive to the branch prediction information, prefetch control 
unit 50 generates fetch addresses for L0 I-cache 16 and a 
prefetch address for LI I-cache 14. In one embodiment, 
prefetch control unit 50 generates two fetch addresses for L0 
I-cache 16. The first fetch address is selected as the target 
address corresponding to the first branch instruction identi- 
fied by branch scanner 58 (if any). The second fetch address 
is the sequential address to the fetch address selected in the 
previous clock cycle (i.e. the fetch address corresponding to 
the run selected by instruction select mux 54). 

L0 I-cache 14 provides the cache lines (and prcdccodc 
information) corresponding to the two fetch addresses, as 
well as the cache lines (and predecode information) which 
are sequential to each of those cache lines, to select next 
blocks 52B-52C. More particularly, select next block 52B 
receives the sequential cache line corresponding to the 
sequential address and the next incremental cache line to the 
sequential cache line. Select next block 52C receives the 
target cache line corresponding to the target address as well 
as the cache line sequential to the target cache line. 
Additionally, select next blocks 52B-52C receive the offset 
portion of the corresponding fetch address. Select next 
blocks 52B-52C each select a run of instruction bytes (and 
corresponding predecode information) from the received 
cache lines, beginning with the run section including the 
offset portion of the corresponding fetch address. Since the 
offset portion of each fetch address can begin anywhere 
within the cache line, the selected run may included portions 
of the fetched cache line and the sequential cache line to the 
fetched cache line. Hence, both the fetched cache line and 
the sequential cache line are received by select next blocks 
52B-52C. 

Similarly, select next block 52A receives a prefetched 
cache line (and corresponding predecode information) from 
LI I-cache 14 and selects an instruction run therefrom. Since 
one cache line is prefetched from LI I-cache 14, the run 
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selected therefrom may comprise less than a full run if the 
offset portion of the prefetch address is near the end of the 
cache line. It is noted that the fetch cache lines from L0 
I-cache 16 may be provided in the same clock cycle as the 

5 corresponding addresses are generated by prefetch control 
unit 50, but the prefetch cache line may be a clock cycle 
delayed due to the larger size and slower access time of LI 
I-cache 14. In addition to providing the prefetched cache line 
to select next block 52A, LI I-cache 14 provides the 

10 prefetched cache line to L0 I-cache 16. If the prefetched 
cache line is already stored within L0 I-cache 16, L0 I-cache 
16 may discard the prefetched cache line. However, if the 
prefetched cache line is not already stored in ID I-cache 14, 
the prefetched cache line is stored into L0 I-cache 16. In this 

15 manner, cache lines which may be accessed presently are 
brought into L0 I-cache 16 for rapid access therefrom. 
According to one exemplary embodiment, L0 I-cache 16 
comprises a fully associative cache structure of eight entries. 
A fully associative structure may be employed due to the 

20 relatively small number of cache lines included in L0 
I-cache 16. Other embodiments may employ other organi- 
zations (e.g. set associative or direct-mapped). 

Prefetch control unit 50 selects the instruction run pro- 
vided by one of select next blocks 52 in response to branch 

25 prediction information by controlling instruction select mux 
54. As will be explained in more detail below, prefetch 
control unit 50 receives target addresses from branch scan- 
ner 58, return stack 64, and indirect address cache 66 early 
in the clock cycle as well as at least a portion of the opcode 

30 byte of the first branch instruction identified by branch 
scanner 58. Prefetch control unit 50 decodes the portion of 
the opcode byte to select the target address to be fetched 
from L0 I-cache 16 from the various target address sources 
and provides the selected target address to L0 I-cache 16. In 

35 parallel, the sequential address to the fetch address selected 
in the previous clock cycle (either the target address or the 
sequential address from the previous clock cycle, depending 
upon the branch prediction from the previous clock cycle) is 
calculated and provided to L0 I-cache 16. Branch prediction 

40 information (i.e. taken or not taken) is provided by branch 
history table 60 late in the clock cycle. If the branch 
instruction corresponding to the target address fetched from 
L0 I-cache 16 is predicted taken, then prefetch control unit 
50 selects the instruction run provided by select next block 

45 52C. On the other hand, if the branch instruction is predicted 
not taken, then the instruction run selected by select next 
block 52B is selected. The instruction run provided by select 
next block 52A is selected if a predicted fetch address 
missed L0 I-cache 16 in a previous clock cycle and was 

50 fetched from LI I-cache 14. Additionally, the instruction run 
from Lf I-cache 14 is selected if the instruction run was 
prefetched responsive to a branch instruction have a 32 bit 
displacement or indirect target address generation or an L0 
I-cache miss was fetched. 

55 The selected instruction run is provided to instruction 
scanner 56 and branch scanner 58. Instruction scanner 56 
scans the predecode information corresponding to the 
selected instruction run to identify instructions within the 
instruction run. More particularly in one embodiment, 

60 instruction scanner 56 scans the start bits corresponding to 
each run section in parallel and identifies up to five instruc- 
tions within each run section. Pointers to the identified 
instructions (offsets within the run section) are generated. 
The pointers, instruction bytes, and addresses (one per run 

65 section) are conveyed by instruction scanner 56 to instruc- 
tion queue 20. If a particular run section includes more than 
five instructions, the information corresponding to run sec- 
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tions subsequent to the particular run section is invalidated would be selected if the first branch instruction is predicted 

and the particular run section and subsequent run sections taken) and then selecting one of the two predictors based on 

are rescanned during the next clock cycle. the actual prediction selected for the first branch instruction. 

Branch scanner 58 scans the instruction run in parallel Branch history table 60 receives information regarding 

with instruction scanner 56. Branch scanner 58 scans the 5 the execution of branch instructions from functional units 

start bits and control transfer bits of the instruction run to 32A-32D. The history of recent predictions corresponding 

identify the first two branch instructions within the instruc- to the executed branch instruction as well as the fetch 

tion run. As described above, a branch instruction is iden- address of the executed branch instruction arc provided for 

tilled by the control transfer bit corresponding to the start selecting a predictor to update, as well as the taken/not taken 

byte of an instruction (as identified by the start bit) being set. 10 result of the executed branch instruction. Branch history 

Upon locating the first two branch instructions, branch table 60 selects the corresponding predictor and updates the 

scanner 58 assumes that the instructions are relative branch predictor based on the taken/not taken result. In one 

instructions and selects the corresponding encoded target embodiment, the branch history table stores a bimodal 

addresses from the instruction bytes following the start byte counter. The bimodal counter is a saturating counter which 

of the branch instruction. For embodiments employing the 1S saturates at a minimum and maximum value (i.e. subsequent 

x86 instruction set, a nine bit target address (the displace- decrements of the minimum value and increments the maxi- 

ment byte as well as the corresponding control transfer bit) mum value cause no change in the counter). Each time a 

, ' , , 00 1 . , , , . 1 * j 11 branch instruction is taken, the corresponding counter is 

is selected, and a 32 bit target address is selected as well. . 4 , , <_ 4 . \ , . * . b . . . ^ 

r ^ 4 . ri1 11 . .j .^1 1 incremented and each time a branch instruction is not taken, 

Furthermore, at least a portion or the opcode byte identified ±c co ondi countcr is dccrcmcntcd . T he most signifi- 

by the start and contort transfer bits is selected. The target 2Q cmt ^ Qf ^ CQunter indicates ^ takGn/no{ taken dic _ 

addresses and opcode bytes are routed to prefetch control tion (e g taken if ^ not taken if dear) In one embodiment5 

unit 50 for use in selecting a target address for fetching from branch history table 60 stores 64K predictors and maintains 

L0 I-cache 16. The fetch addresses of each branch instruc- a history of the 16 most recent predictions. Each clock cycle, 

tion (determined from the fetch address of the run section me predictions selected during the clock cycle are shifted 

including each branch instruction and the position of the 25 into the history and the oldest predictions are shifted out of 

branch instruction within the section) are routed to branch the history. 

history table 60 for selecting a taken/not-taken prediction Return stack 64 is used to store the return addresses 

corresponding to each branch instruction. Furthermore, the corresponding to detected subroutine call instructions, 

fetch addresses corresponding to each branch instruction are Return stack 64 receives the fetch address of a subroutine 

routed to branch select mux 62, which is further routed to 30 call instruction from branch scanner 58. The address of the 

indirect address cache 66. The target address of each branch byte following the call instruction (calculated from the fetch 

instruction is routed to forward collapse unit 68. According address provided to return stack 64) is placed at the top of 

to one embodiment, branch scanner 58 is configured to scan return stack 64. Return stack 64 provides the address stored 

each run section in parallel for the first two branch instruc- at the top of the return stack to prefetch control unit 50 for 

tions and then to combine the scan results to select the first 35 selection as a target address if a return instruction is detected 

two branch instructions within the run. by branch scanner 58 and prefetch control unit 50. In this 

Branch scanner 58 may further be configured to determine manner, each return instruction receives as a target address 

if a subroutine call instruction is scanned during a clock the address corresponding to the most recently detected call 

cycle. Branch scanner 58 may forward the fetch address of instruction. Generally in the x86 instruction set, a call 

the next instruction following the detected subroutine call 40 instruction is a control transfer instruction which specifies 

instruction to return stack 64 for storage therein. that the sequential address to the call instruction be placed 

In one embodiment, if there arc more than two branch on the stack defined by the x86 architecture. A return 

instructions within a run, the run is scanned again during a instruction is an instruction which selects the target address 

subsequent clock cycle to identify the subsequent branch from the top of the stack. Generally, call and return instruc- 

instruction. 45 tions are used to enter and exit subroutines within a code 

The fetch addresses of the identified branch instructions sequence (respectively). By placing addresses correspond- 

are provided to branch history table 60 to determine a ing to call instructions in return stack 64 and using the 

taken/not taken prediction for each instruction. Branch his- address at the top of return stack 64 as the target address of 

tory table 60 comprises a plurality of taken/not-taken pre- return instructions, the target address of the return instruc- 

dictors corresponding to the previously detected behavior of 50 tion may be correctly predicted. In one embodiment, return 

branch instructions. One of the predictors is selected by stack 64 ma Y comprise 16 entries. 

maintaining a history of the most recent predictions and Indirect address cache 66 stores target addresses corre- 
exclusive ORing those most recent predictions with a por- sponding to previous executions of indirect branch instruc- 
tion of the fetch addresses corresponding to the branch tions. The fetch address corresponding to an indirect branch 
instructions. The least recent (oldest) prediction is exclusive 55 instruction and the target address corresponding to execution 
ORed with the most significant bit within the portion of the of the indirect branch instruction are provided by functional 
fetch address, and so forth through the most recent predic- units 32A-32D to indirect address cache 66. Indirect address 
tion being exclusive ORed with the least significant bit cache 66 stores the target addresses indexed by the corre- 
within the portion of the fetch address. Since two predictors sponding fetch addresses. Indirect address cache 66 receives 
are selected per clock cycle, the predictor corresponding to 60 the fetch address selected by branch select mux 62 
the second branch instruction is dependent upon the predic- (responsive to detection of an indirect branch instruction) 
tion of the first branch instruction (for exclusive ORing with and, if the fetch address is a hit in indirect address cache 66, 
the least significant bit of the corresponding fetch address). provides the corresponding target address to prefetch control 
Branch history table 60 provides the second predictor by unit 50. In one embodiment, indirect address cache 66 may 
selecting both of the predictors which might be selected (i.e. 65 comprise 32 entries. 

the predictor that would be selected if the first branch According to one contemplated embodiment, if indirect 

instruction is predicted not-taken and the predictor that address cache 66 detects a miss for a fetch address, indirect 
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address cache 66 may be configured to select a target address 
to provide from one of the entries. In this manner, a "guess" 
at a branch target is provided in case an indirect branch 
instruction is decoded. Fetching from the guess may be 
performed rather than awaiting the address via execution of 
the indirect branch instruction. Alternatively, another con- 
templated embodiment awaits the address provided via 
execution of the indirect branch instruction. 

According to one embodiment, prefetch control unit 50 
selects the target address for fetching from L0 I-cache 16 
from: (i) the first encoded target address corresponding to 
the first branch instruction identified by branch scanner 58; 

(ii) the return stack address provided by return stack 64; and 

(iii) a sequential address. Prefetch control unit 50 selects the 
first encoded target address if a decode of the opcode 
corresponding to the first instruction indicates that the 
instruction may be a relative branch instruction. If the 
decode indicates that the instruction may be a return 
instruction, then the return stack address is selected. 
Otherwise, the sequential address is selected. Indirect target 
addresses and 32 bit relative target addresses are prefetched 
from LI I-cache 14. Since these types of target addresses are 
often used when the target address is not near the branch 
instruction within memory, these types of target addresses 
arc less likely to hit in L0 I -cache 16. Additionally, if the 
second branch instruction is predicted taken and the first 
branch instruction is predicted not taken or the first branch 
instruction is a forward branch which does not eliminate the 
second branch instruction in the instruction run, the second 
target address corresponding to the second branch prediction 
may be used as the target fetch address during the succeed- 
ing clock cycle according to one embodiment. 

It is noted that, if an encoded target address is selected, the 
actual target address may be presented to L0 I-cache 16. 
Prefetch control unit 50 may be configured to precalculate 
each of the possible above/below target addresses and select 
the correct address based on the encoded target address. 
Alternatively, prefetch control unit 50 may record which L0 
I-cache storage locations are storing the above and below 
cache lines, and select the storage locations directly without 
a tag compare. 

Forward collapse unit 68 receives the target addresses and 
positions within the instruction run of each selected branch 
instruction as well as the taken/not taken predictions. For- 
ward collapse unit 68 determines which instructions within 
the run should be cancelled based upon the received pre- 
dictions. If the first branch instruction is predicted taken and 
is backward (i.e. the displacement is negative), all instruc- 
tions subsequent to the first branch instruction are cancelled. 
If the first branch instruction is predicted taken and is 
forward but the displacement is small (e.g. within the 
instruction run), the instructions which are between the first 
branch instruction and the target address are cancelled. The 
second branch instruction, if still within the run according to 
the first branch instruction's prediction, is treated similarly. 
Cancel indications for the instructions within the run are set 
to instruction queue 20. 

Prefetch control unit 50 may be further configured to 
select a cache line within L0 I-cache 16 for replacement by 
a cache line provided from LI I-cache 14. In one 
embodiment, prefetch control unit 50 may use a least 
recently used (LRU) replacement algorithm. 

Turning now to FIG. 3, a block diagram of one embodi- 
ment of lookahe ad/collapse unit 24 is shown. Other embodi- 
ments are possible and contemplated. As shown in FIG. 3, 
lookahe ad/collapse unit 24 includes a plurality of decode 
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units 70A-70F, an ESP/EBP lookahead unit 72, a lookahead 
address/result calculation unit 74, a dispatch control unit 76, 
and an operand collapse unit 78. Decode units 70A-70F are 
coupled to receive instructions from alignment unit 22. 

5 Decode units 70A-70F are coupled to provide decoded 
instructions to FPU/multimedia unit 40, ESP/EBP looka- 
head unit 72, future file 26, and lookahead address/result 
calculation unit 74. ESP/EBP lookahead unit 72 is coupled 
to lookahead address/result calculation unit 74, as is future 

10 file 26. Lookahead address/result calculation unit 74 is 
further coupled load/store unit 36 and dispatch control unit 
76. Dispatch unit 76 is further coupled to operand collapse 
unit 78, future file 26, load/store unit 36, and reorder buffer 
28. Operand collapse unit 78 is coupled to instruction 

15 windows 30. 

Each decode unit 70A-70F forms an issue position to 
which alignment unit 22 aligns an instruction. While not 
indicated specifically throughout FIG. 3 for simplicity the 
drawing, a particular instruction remains within its issue 

20 position as the instruction moves through lookahead/ 
collapse unit 24 and is routed to one of instruction windows 
30A-30B if not completed within lookahead/collapse unit 
24. 

Decode units 70A-70F route FPU/multimedia instruc- 

25 tions to FPU/multimedia unit 40. However, if the FPU/ 
multimedia instructions include memory operands, memory 
operations are also dispatched to load/store unit 36 in 
response to the instruction through lookahead address/result 
calculation unit 74. Additionally, if the address for the 

30 memory operations cannot be generated by lookahead 
address/result calculation unit 74, an address generation 
operation is dispatch to one of address generation units 
34A-34D via instruction windows 30A-30B. Still further, 
entries within reorder buffer 28 are allocated to the FPU/ 

35 multimedia instructions for maintenance of program order. 
Generally, entries within reorder buffer 28 are allocated from 
decode units 70A-70F for each instruction received therein. 

Each of decode units 70A-70F are further configured to 
determine: (i) whether or not the instruction uses the ESP or 

40 EBP registers as a source operand; and (ii) whether not the 
instruction modifies the ESP/EBP registers (i.e. has the ESP 
or EBP registers as a destination operand). Indications of 
these determinations are provided by decode units 70A-70F 
to ESP/EBP lookahead unit 72. ESP/EBP lookahead unit 72 

45 generates lookahead information for each instruction which 
uses the ESP or EBP registers as a source operand. The 
lookahead information may include a constant to be added 
to the current lookahead value of the corresponding register 
and an indication of a dependency upon an instruction in a 

50 prior issue position. In one embodiment, ESP/EBP looka- 
head unit 72 is configured to provide lookahead information 
as long as the set of concurrently decoded instructions 
provided by decode units 70A-70F do not include more 
than: (i) two push operations (which decrement the ESP 

55 register by a constant value); (ii) two pop operations (which 
increment ESP register by a constant value); (iii) one move 
to ESP register; (iv) one arithmetic/logical instruction hav- 
ing the ESP as a destination; or (v) three instructions which 
update ESP. If one of these restrictions is exceeded, ESP/ 

60 EBP lookahead unit 72 is configured to stall instructions 
beyond those which do not exceed restrictions until the 
succeeding clock cycle (a "split line" case). For those 
instructions preceded, in the same clock cycle but in earlier 
issue positions, by instructions which increment or decre- 

65 ment the ESP register, ESP/EBP lookahead unit 72 generates 
a constant indicating the combined total modification to the 
ESP register of the preceding instructions. For those instruc- 
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tions preceded by a move or arithmetic operation upon the 
ESP or EBP registers, ESP/EBP lookahead unit 72 generates 
a value identifying the issue position containing the move or 
arithmetic instruction. 

The lookahead values may be used by lookahead address/ 
result calculation unit 74 to generate either a lookahead 
address corresponding to the instruction within the issue 
position (thereby inhibiting an address generation operation 
which would otherwise be performed by one of address 
generation units 34A-34D) or a lookahead result corre- 
sponding to the instruction (thereby providing lookahead 
state to future file 26 earlier in the pipeline). Performance 
may be increased by removing address generation opera- 
tions and/or providing lookahead state prior to functional 
units 32A-32D and address generation units 34A-34D. 
Many x86 code sequences include a large number of rela- 
tively simple operations such as moves of values from a 
source to destination without arithmetic/logical operation or 
simple arithmetic operations such as add/sub tract by small 
constant or increment/decrement of a register operand. 
Accordingly, functional units 32A-32D may typically 
execute the more complex arithmetic/logical operations and 
branch instructions and address generation units 34A-34D 
may typically perform the more complex address genera- 
tions. Instruction throughput may thereby be increased. 

Decode units 70A-70F are still further configured to 
identify immediate data fields from the instructions decoded 
therein. The immediate data is routed to lookahead address/ 
result calculation unit 74 by decode units 70A-70F. 
Additionally, decode unit 70A-70F are configured to iden- 
tify register operands used by the instructions and to route 
register operand requests to future file 26. Future file 26 
returns corresponding speculative register values or result 
queue tags for each register operand. Decode units 70 further 
provide dependency checking between the. line of instruc- 
tions to ensure that an instruction which uses a result of an 
instruction within a different issue position receives a tag 
corresponding to that issue position. 

Lookahead address/result calculation unit 74 receives the 
lookahead values from ESP/EBP lookahead units 72, the 
immediate data from decode units 70A-70F, and the specu- 
lative register values or result queue tags from future file 26. 
Lookahead address/result calculation unit 74 attempts to 
generate either a lookahead address corresponding to a 
memory operand of the instruction, or a lookahead result if 
the instruction does not include a memory operand. For 
example, simple move operations can be completed (with 
respect to functional units 32 and address generation units 
34) if an address generation can be performed by lookahead 
address/result calculation unit 74. In one embodiment, loo- 
kahead address/result calculation unit 74 is configured to 
compute addresses using displacement only, register plus 
displacement; ESP/EBP plus displacement, and scale-index- 
base addressing mode except for index or base registers 
being ESP/EBP. Load/store unit 36 performs the memory 
operation and returns the memory operation results via result 
buses 48. Even if no address is generated for a memory 
operation by lookahead address/result calculation unit 74, 
lookahead address/result calculation unit 74 indicates the 
memory operation and corresponding result queue tag to 
load/store unit 36 to allocate storage within load/store unit 
36 for the memory operation. 

Simple arithmetic operations which increment or decre- 
ment a source operand, add/subtract a small immediate 
value to a source operand, or add/subtract two register 
source operands may also be completed via lookahead 
address/result calculation unit 74 if the source operands are 
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available from future file 26 (i.e. a speculative register value 
is received instead of a result queue tag). Instructions 
completed by lookahead address/result calculation units 74 
are indicated as completed and are allocated entries in 

5 reorder buffer 28 but are not dispatched to instruction 
windows 30. Lookahead address/result calculation unit 74 
may comprise, for example, an adder for each issue position 
along with corresponding control logic for selecting among 
the lookahead values, immediate data, and speculative reg- 
ister values. It is noted that simple arithmetic operations may 
still be forwarded to instruction windows 30 for generation 
of condition flags, according to the present embodiment. 
However, generating the functional result in lookahead 
address/result calculation unit 74 provides the lookahead 

15 state early, allowing subsequent address generations/ 
instructions to be performed early as well. 

Lookahead address/result calculation unit 74 may be 
configured to keep separate lookahead copies of the ESP/ 
EBP registers in addition to the future file copies. However, 

20 if updates to the ESP/EBP are detected which cannot be 
calculated by lookahead address/result calculation unit 74, 
subsequent instructions may be stalled until a new looka- 
head copy of the ESP/EBP can be provided from future file 
26 (after execution of the instruction which updates ESP/ 

95 EBP in the undeterminable manner). 

Dispatch control unit 76 determines whether or not a 
group of instructions are dispatched to provide pipeline flow 
control. Dispatch control unit 76 receives instruction counts 
from instruction windows 30 and load/store counts from 

30 load/store unit 36 and, assuming the maximum possible 
number of instructions are in flight in pipeline stages 
between dispatch control units 76 and instruction windows 
30 and load/store unit 36, determines whether or not space 
will be available for storing the instructions to be dispatched 

35 within instruction windows 30 and/or load/store unit 36 
when the instructions arrive therein. If dispatch control unit 
76 determines that insufficient space will be available in 
load/store unit 36 and either instruction window 30, dispatch 
is stalled until the instruction counts received by dispatch 

40 control unit 76 decrease to a sufficiently low value. 

Upon releasing instructions for dispatch through dispatch 
control unit 76, future file 26 and reorder buffer 28 arc 
updated with speculatively generated lookahead results. In 
one embodiment, the number of non-ESP/EBP updates 

45 supported may be limited to, for example, two in order to 
limit the number of ports on future file 26. Furthermore, 
operand collapse unit 78 collapses speculatively generated 
lookahead results into subsequent, concurrently decoded 
instructions which depend upon those results as indicated by 

so the previously determined intraline dependencies. In this 
manner, the 25 dependent instructions receive the specula- 
tively generated lookahead results since these results will 
not subsequently be forwarded from functional units 
32A-32D. Those instructions not completed by lookahead 

55 address/result calculation unit 74 are then transmitted to one 
of instruction windows 30A-30B based upon the issue 
position to which those instructions were aligned by align- 
ment unit 22. 

It is noted that certain embodiments of processor 10 may 
60 employ a microcode unit (not shown) for executing complex 
instructions by dispatching a plurality of simpler instructions 
referred to as a microcode routine. Decode units 70A-70F 
may be configured to detect which instructions are micro- 
code instructions and to route the microcode instructions to 
65 the microcode unit. For example, the absence of a directly 
decoded instruction output from a decode unit 70 which 
received a valid instruction may be an indication to the 
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microcode unit to begin execution for the corresponding Responsive to the signal from byte predecoder 84 indi- 
valid instruction. Is further noted that various storage eating that a relative branch instruction has been detected, 
devices are shown in FIGS. 2 and 3 (e.g. devices 79A, 79B, control unit 86 causes target generator 88 to generate the 
and similar devices in FIG. 2 and devices 79C, 79D and target address corresponding to the relative branch instruc- 
similar devices in FIG. 3). The storage devices represent 5 tion. The displacement byte or bytes are selected from the 
latches, registers, flip-flops and the like which may be used instruction bytes stored in register 80. Additionally, the fetch 
to separate pipeline stages. However, the particular pipeline address stored in fetch address register 82 (with the offset 
stages shown in FIGS. 2 and 3 are but one embodiment of portion replaced by the position of the instruction subse- 
suitable pipeline stages for one embodiment of processor 10. quent to the branch instruction) is provided to target gen- 
Other pipeline stages may be employed in other embodi- erator 88. Target generator 88 adds the received address and 
merits. displacement byte or bytes, thereby generating the target 
It is noted that, while the x86 instruction set and archi- address. The generated target address is then encoded for 
tecture has been used as an example above and may be used storage as a replacement for the displacement field of the 
as an example below, any instruction set and architecture relative branch instruction. Additionally, control unit 86 
may be used. Additionally, displacements may be any desir- se l ec t the output of target generator 88 to be stored into 
able size (in addition to the 8 bit and 32 bit sizes used as output i nstrU ction bytes register 92 instead of the corre- 
examples herein). Furthermore, while cache line fetching spond ing displacement bytes of the relative branch instruc- 
may be described herein it is noted that cache lines may be tion from { t instruction bytcs istcr 80< other instruc- 
sectors, and sectors may be fetched if desirable based upon don b are frQm { instmction b tes ^ 

cache line size and the number of bytcs desired to be fetched, one . ■ . . • 4 u . • , m 

,„,, . ,, t . , ,. 80 for storage in output instruction bytes register 92 as those 

turning now to HO. 4, a block diagram of one embodi- 20 ^ decoded fe byte decoder 84 0nce b te 

ment of predecode unit 12 1 is shown. Other embodiments are decoder 84 has lete " d predecode 0 f the cache line 

possible and contemplated. As shown in FIG. 4, predecode r . t 1 i i • A i % j j- 1 

unit 12 includes an input instruction bytes register 80, a fetch and ^ch relative branch instruction has had its displacement 

address register 82, a byte predecoder 84, a control unit 86, re P laced b ^ an ei \ codin § f the tar §^ address > octroi unit 86 
a target generator 88, a start and control transfer bits register 25 asserts a Precede complete signal to Lf I-cache 14, which 

90, an output instruction bytes register 92, and a byte select then stores the out P ut instruction bytes and corresponding 

mux 94. Input instruction bytes register 80 is coupled to byte start and control transfer bits. 

predecoder 84, control unit 86, target generator 88, byte As described above, for relative branch instructions hav- 

select mux 94, and external interface unit 42. Fetch address ing small displacement fields (e.g. a single displacement 
register 82 is coupled to LI I-cache 14 and target generator 30 byte) the control transfer bit corresponding to the displace- 

88. Byte predecoder 84 is coupled to start and control ment byte is used in addition to the displacement byte to 

transfer bits register 90 and control unit 86. Control unit 86 store the encoding of the target address. Target generator 88 

is coupled to L1 I-cache 14, byte select mux 94, and target signals byte predecoder 84 with the appropriate control 

generator 88. Target generator 88 is coupled to byte select transfer bit, which byte predecoder 84 stores in the corre- 

mux 94, which is further coupled to output instruction bytes spending position within start and control transfer bits 

register 92. Output instruction bytes register 92 and start and register 90. 

control transfer bits register 90 are further coupled to LI T . . , +u . w u u • * *u 

I cache 14 1S ' a reiatlve branch instruction spans the 

' . p T ^ t i - ■ boundary between two cache lines (i.e. a first cache line 

Upon detection of an LI I-cache miss, predecode unit 12 n . + - * *u ■ * *• j *u a- 

f ^ , . r . i 1 i 1 • . .i • stores a first portion of the instruction and the succeeding 

receives the linear fetch address corresponding to the miss u1 . + , +u •■ + - \ a a 

into fetch address register 82. In parallel, external interface 40 cache line stored the remaining portion) predecode unit 12 

unit 42 receives the corresponding physical fetch address ma / be configured to fetch the succeeding cache line m 

and initiates an external fetch for the cache line identified by order to complete the predecoding for the relative branch 

the fetch address. External interface unit 42 provides the instruction. It is further noted that predecode unit 12 may be 

received instruction bytes to input instruction bytes register configured to handle multiple outstanding cache lines simul- 
80. 1 45 taneously. 

Byte predecoder 84 predecodes the received instruction Turning next to FIG. 4A, a block diagram of one embodi- 

bytes to generate corresponding start and control transfer ment of target generator 88 is shown. Other embodiments 

predecode bits. The generated predecode information is are possible and contemplated. As shown in FIG. 4A, target 

stored into start and control transfer bits register 90. Because generator 88 includes a displacement mux 100, a sign extend 
instructions can have boundaries at any byte within the 50 block 102, an adder 104, and a displacement encoder 106. 

cache line due to the variable length nature of the x86 Displacement mux 100 is coupled to input instruction bytes 

instruction set, byte predecoder 84 begins predecoding at the register 80 and sign extend block 102, and receives control 

offset within the cache line specified by the fetch address signals from control unit 86. Sign extend block 102 is 

stored within fetch address register 82. The byte specified by coupled to an input of adder 104 and receives control signals 
the offset is assumed to be the first byte of an instruction (i.e. 55 from control unit 86. The second input of adder 104 is 

the corresponding start bit is set). Byte predecoder 84 coupled to receive the fetch address from fetch address 

predecodes each byte beginning with the first byte to deter- register 82 (except for the offset bits) concatenated with a 

mine the beginning of each instruction and to detect branch position within the cache line from control unit 86. Adder 

instructions. Branch instructions result in the control transfer 104 is further coupled to displacement encoder 106 which 
bit corresponding to the start byte of the branch instruction 60 receives control signals from control unit 86. Displacement 

being set by byte predecoder 84. Additionally, byte prede- encoder 106 is further coupled to byte select mux 94 and 

coder 84 informs control unit 86 if the branch instruction is byte predecoder 84. 

a relative branch instruction and indicates the position of the Displacement mux 100 is used to select a displacement 

instruction subsequent to the branch instruction within the byte or bytes from the relative branch instruction. In the 
cache line. In one embodiment, byte predecoder 84 is 65 present embodiment, displacements may be one or four 

configured to predecode four bytes per clock cycle in bytes. Accordingly, displacement mux 100 selects four bytes 

parallel. from input instruction bytes register 80. If a one byte 
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displacement is included in the relative branch instruction, 
the displacement is selected into the least significant of the 
four bytes. The remaining three bytes may be zeros or may 
be prior bytes within input instruction bytes register 80. Sign 
extend block 102, under control from control unit 86, sign 
extends the one byte displacement to a four byte value. On 
the other hand, a four byte displacement is selected by 
displacement mux 100 and is not modified by sign extend 
block 102. It is noted that larger addresses may be employed 
by processor 10. Generally, the displacement may be sign 
extended to the number of bits within the address. 

Displacement encoder 106 receives the target address 
calculated by adder 104 and encodes the target address into 
a format storable into the displacement bytes. In the present 
embodiment, a four byte displacement stores the entirety of 
the target address. Hence, displacement encoder 106 passes 
the target address unmodified to byte select mux 94 for 
storage in output instruction bytes register 92. Additionally, 
the control transfer bits corresponding to the displacement 
bytes are. not used. For one byte displacements, the target 
address is encoded. More particularly, a portion of the 
displacement byte is used to store the offset of the target 
address within the target cache line (e.g. in the present 
embodiment, 6 bits to store a 64 byte offset). The remaining 
portion of the displacement byte and the corresponding 
control transfer bit is encoded with a value indicating the 
target cache line as a number of cache lines above or below 
the cache line identified by the fetch address stored in fetch 
address register 82. Accordingly, displacement encoder 106 
is coupled to receive the fetch address from fetch address 
register 82. Displacement encoder 106 compares the fetch 
address to the target address to determine not only the 
number of cache lines therebetween, but the direction. Upon 
generating the encoding, displacement encoder 106 trans- 
mits the modified displacement byte to byte select mux 94 
for storage in output instruction bytes register 92 and also 
transmits the value for the control transfer bit corresponding 
to the displacement byte to byte predecoder 84. 

As an alternative to employing adder 104 to calculate 
target addresses for small displacement fields, displacement 
encoder 106 may directly generate the encoded target 
address (above below value and cache line offset) by exam- 
ining the value of the displacement field and the position of 
the branch instruction within the cache line. 

Turning now to FIG. 5, a diagram illustrating an exem- 
plary relative branch instruction 110 having an eight bit 
displacement according to the x86 instruction set is shown. 
Relative branch instruction 110 includes two bytes, an 
opcode byte 112 which is also the first byte of the instruction 
and a displacement byte 114. Opcode byte 112 specifies that 
instruction 110 is a relative branch instruction and that the 
instruction has an eight bit displacement. Displacement byte 
114 has been updated with an encoding of the target address. 
The encoding includes a cache line offset portion labeled 
"CL offset" (which comprises six bits in the current embodi- 
ment but may comprise any number bits suitable for the 
corresponding instruction cache line size) and a relative 
cache line portion labeled "LI2" in the control transfer bit 
corresponding to displacement byte 114 and "LI1 LI0" 
within displacement byte 114. 

FIG. 5 also illustrates the start and control transfer bits 
corresponding to instruction 110. The start bit for each byte 
is labeled "S" in FIG. 5 with a box indicating the value of 
the bit, and the control transfer bit is labeled "C" with a box 
indicating the value of the bit. Accordingly, the start bit 
corresponding to opcode byte 112 is set to indicate that 
opcode byte 112 is the beginning of an instruction and the 
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control transfer bit corresponding to opcode byte 112 is also 
set to indicate that the instruction beginning at opcode byte 
112 is a control transfer instruction. The start bit correspond- 
ing to displacement byte 114, on the other hand, is clear 

5 because displacement byte 114 is not the beginning of an 
instruction. The control transfer bit corresponding to dis- 
placement byte 114 is used to store a portion of the relative 
cache line portion of the encoded target address. 

Turning next to FIG. 6, an exemplary relative branch 

10 instruction 120 having a 32-bit displacement according to 
the x86 instruction set is shown. Instruction 120 includes an 
opcode field 122 comprising two bytes and a displacement 
field 124 comprising four bytes. Similar to FIG. 5, FIG. 6 
illustrates the start and control transfer bits for each byte 

15 within instruction 120. Accordingly, two start bits and two 
control transfer bits are illustrated for opcode field 122, and 
one start bit and control transfer bit are illustrated for each 
byte within displacement field 124. 

The first start bit corresponding to opcode field 122 (i.e. 

20 the start bit corresponding to the first byte of opcode field 
122) is set, indicating that the first byte of opcode field 122 
is the beginning of an instruction. The first control transfer 
bit corresponding to opcode field 122 is also set indicating 
that instruction 120 is a control transfer instruction. The 

25 second start bit corresponding to opcode field 122 is clear, 
as the second byte within opcode field 122 is not the start of 
instruction. The control transfer bit corresponding to the 
second opcode byte is a don't care (indicated by an "x"). 

Since displacement field 124 is large enough to contain 
the entirety of the target address corresponding to instruction 
120, the control transfer bits corresponding to the displace- 
ment bytes are also don't cares. Each start bit corresponding 
to displacement byte is clear, indicating that that these bytes 
are not the start of an instruction. 

35 

Turning now to FIG. 7, a diagram of an exemplary set of 
instructions 130 from the x86 instruction set are shown, 
further illustrating use of the start and control transfer bits 
according to one embodiment of processor 10. Similar to 
4Q FIGS. 5 and 6, each byte within the set of instructions 130 
is illustrated along with a corresponding start bit and control 
transfer bit. 

The first instruction within set of instructions 130 is an 
add instruction which specifies addition of a one byte 

45 immediate field to the contents of the AL register and storing 
the result in the AL register. The add instruction is a two byte 
instruction in which the first byte is the opcode byte and the 
second byte is the one byte immediate field. Accordingly, the 
opcode byte is marked with a set start bit indicating the 

50 beginning of the instruction. The corresponding control 
transfer bit is clear indicating that the add instruction is not 
a branch instruction. The start bit corresponding to the 
immediate byte is clear because the immediate byte is not 
the start of an instruction, and the control transfer bit is a 

55 don't care. 

Subsequent to the add instruction is a single byte instruc- 
tion (an increment of the EAX register). The start bit 
corresponding to the instruction set because the byte is the 
beginning of instruction. The control transfer bit is clear 

60 since the increment is not a branch instruction. 

Finally, a second add instruction specifying the addition 
of a one byte immediate field to the contents of the AL 
register is shown subsequent to the increment instruction. 
The start bit corresponding to the opcode of the add instruc- 

65 tion is set, and the control transfer bit is clear. The increment 
instruction followed by the add instruction illustrates that 
consecutive bytes can have start bits which are set in the case 
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where a single byte is both the start boundary and end 
boundary of the instruction. 

Turning now to FIG. 8, a block diagram of one embodi- 
ment of branch scanner 58 is shown for use with the x86 
instruction set. Other embodiments are possible and con- 
templated. In the embodiment of FIG. 8, branch scanner 58 
includes a scan block 140, section target muxes 
142A-142D, and run target muxes 144A-144D. Scan block 
140 is coupled to receive the start and control transfer bits 
corresponding to a run section from select next blocks 52 
through instruction select mux 54. Branch scanner 58 further 
includes additional scan blocks similar to scan block 140 for 
scanning the start and control transfer bits corresponding to 
the remaining run sections of the selected run. Scan block 
140 is coupled to section target muxes 142A-142D to 
provide selection controls thereto. Additionally, scan block 
140 (and similar scan blocks for the other run sections) 
provide selection controls for run target muxes 144A-144D. 
Each of section target muxes 142A-142B is coupled to 
receive the instruction bytes corresponding to the run section 
scanned by scan block 140 as well as the corresponding 
control transfer bits. Each of section target muxes 
142C-142D are coupled receive the instruction bytes cor- 
responding Lo the run section as well, but may not receive the 
corresponding control transfer bits. Each of section target 
muxes 142A-142D is coupled to respective one of run target 
muxes 144A-144D as shown in FIG. 8. The outputs of run 
target muxes 144A and 144B are coupled to prefetch control 
unit 50 and to branch history table 60. The outputs of run 
target muxes 144C and 144D are coupled to prefetch control 
unit 50. 

Scan block 140 is configured to scan the start and control 
transfer bits received therein in order to locate the first two 
branch instructions within the run section. If a first branch 
instruction is identified within the run section, scan block 
140 directs section target mux 142A to select the opcode 
byte, which is the byte for which both the start and control 
transfer bits are set, and the immediately succeeding byte 
and the control transfer bit corresponding to the immediately 
succeeding byte, which collectively form the encoded target 
address if the first branch instruction includes an eight bit 
relative displacement. Similarly, if a second branch instruc- 
tion is identified within the run section, scan block 140 
directs section target mux 142B to select the opcode byte of 
the second branch instruction and the immediately succeed- 
ing byte and the control transfer bit corresponding to the 
immediately succeeding byte. In this manner, the opcode 
byte and target address corresponding to the first two relative 
branch instructions having eight bit displacement are 
selected. Additionally, the position of each branch instruc- 
tion within the run section is identified by scan block 140. 

Scan block 140 is further configured to control section 
target mux 142C in response to detecting the first branch 
instruction. More particularly, scan block 140 selects the 
four consecutive instruction bytes beginning with the second 
byte following the start byte of the first branch instruction 
(i.e. beginning with the byte two bytes subsequent to the 
start byte of the first branch instruction within the cache 
line). These consecutive instruction bytes are the encoded 
target address if the first branch instruction includes a 32-bit 
relative displacement. Similarly, scan block 140 controls 
section target mux 142D to select the four consecutive start 
bytes beginning with the second byte following the start byte 
of the second branch instruction. In this manner, the target 
address corresponding to the first two relative branch 
instructions having 32-bit displacements are selected. 
Prefetch control unit 50 is configured to determine whether 
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or not either: (i) the target address selected by section target 
mux 142A; (ii) the target address selected by section target 
mux 142C; or (iii) a target address from return stack 64 or 
indirect address cache 66 corresponds to the first branch 

5 instruction. Similarly, prefetch control unit 50 is configured 
determine whether or not either: (i) the target address 
selected by section target mux 142B; (ii) the target address 
selected by section target mux 142D; or (iii) a target address 
from return stack 64 or indirect address cache 66 corre- 

10 sponds to the second branch instruction. 

Scan block 140, in conjunction with similar scan blocks 
for the other sections of the run, controls run target muxes 
144A-144D to select target information corresponding to 
the first two branch instructions within the run. Accordingly, 

15 run target mux 144A selects the target address (i.e. the 
immediately succeeding byte and corresponding control 
transfer bit), opcode, and position of the first branch instruc- 
tion within the run. Similarly, run target mux 144B selects 
the target address, opcode, and position of the second branch 

20 instruction within the run. Run target muxes 144C-144D 
select 32-bit target addresses corresponding to the first and 
second branch instructions, respectively. 

Turning next to FIG. 9, a block diagram of one embodi- 
ment of prefetch control unit 50 is shown. Other embodi- 

25 ments are possible contemplated. As shown in FIG. 9, 
prefetch control unit 50 includes a decoder 150, a fetch 
address mux 152, an incrementor 154, and an LI prefetch 
control unit 156. Decoder 150 is coupled to receive the first 
branch opcode corresponding to the first branch instruction 

30 within the run from branch scanner 58 and to reorder buffer 
28 to receive a misprediction redirection indication and 
corresponding corrected fetch address. Additionally, 
decoder 150 is coupled to fetch address mux 152 and LI 
prefetch control unit 156. Fetch address mux 152 is coupled 

35 to receive the first target address corresponding to the first 
branch instruction within the run as selected by run target 
mux 144A. The second target address corresponding to the 
second branch instruction address is also provided to fetch 
address mux 152 with a one clock cycle delay. Additionally, 

40 fetch address mux 152 is configured to receive the return 
address provided by return stack 64, the corrected fetch 
address provided by reorder buffer 28 upon misprediction 
redirection, and the sequential address to the address fetched 
in the previous clock cycle (generated by incrementor 154). 

45 Fetch address mux 152 is coupled to provide the target fetch 
address to L0 1-cache 16 and to LI prefetch control unit 156. 
LI prefetch control unit 156 is further coupled to L0 1-cache 
16 to receive a miss indication, to indirect address cache 66 
to receive a predicted indirect target address, to branch 

50 scanner 58 to receive 32-bit target addresses corresponding 
to relative branch instructions, to reorder buffer 28 to receive 
branch misprediction addresses, and to LI I -cache 14 to 
provide an LI prefetch address. Prefetch control unit 50 
provides a sequential fetch address to L0 I -cache 16 via a 

55 register 158. 

Decoder 150 is configured to decode the opcode corre- 
spond to the first identified branch instruction from branch 
scanner 58 in order to select the target fetch address for L0 
I-cache 16. In order provide the target fetch address as 

60 rapidly is possible, decoder 150 decodes only a portion of 
the opcode byte received from branch scanner 58. More 
particularly, for the x86 instruction set, decoder 150 may 
decode the four most significant bits of the opcode byte 
identified by the set start and control transfer bits to select 

65 one of the first target address from branch scanner 58, the 
return address from return stack 64, and the sequential 
address. FIG. 10, described in more detail below, is a truth 
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table corresponding to one embodiment of decoder 150. 
Because only a subset of the bits of the opcode byte are 
decoded, fewer logic levels may be employed to generate the 
selection controls for fetch address mux 152, thereby allow- 
ing rapid target address selection. If the target address 5 
selected responsive to the decode is incorrect, the fetched 
instructions may be discarded and the correct fetch address 
may be generated during a subsequent clock cycle. 

Because the branch prediction corresponding to the first 
branch instruction within the run is not available until late in 1Q 
the clock cycle in which the fetch address is selected, 
decoder 150 docs not attempt to select the second branch 
target address as the target fetch address. If the first branch 
instruction is predicted not taken, via branch history table 
60, the second target address corresponding to the second 
identified branch instruction (if any) may be fetched in a 15 
subsequent clock cycle if the second branch instruction is 
predicted taken by branch history table 60. Also, if the first 
branch is predicted taken but the first target address is within 
the same run as the first branch, the sequential address is 
selected. If the first branch does not branch past the second 20 
branch within the run, the second target address is selected 
during the subsequent clock cycle. Similarly, if the first 
branch instruction uses an indirect target address or 32-bit 
relative target address, fetch address mux 152 may select an 
address and the fetched instructions may be discarded in 25 
favor of instructions at the actual branch target. 

LI prefetch control unit 156 generates an LI prefetch 
address for LI 1 -cache 14. The cache line corresponding to 
the LI prefetch address is conveyed to L0 I-cache 16 for 
storage. LI prefetch control unit 156 selects the prefetch 30 
address from one of several sources. If a branch mispredic- 
tion is signalled by reorder buffer 28, the sequential address 
to the corrected fetch address provided by reorder buffer 28 
is selected since the other address sources are based upon 
instructions within the mispredicted path. If no branch 35 
misprediction is signalled and an L0 fetch address miss is 
detected, LI prefetch control unit 156 selects the L0 fetch 
address miss for prefetching. If no miss is detected, LI 
prefetch control unit 156 selects either the indirect address 
provided by indirect address cache 66 or a 32-bit branch 40 
target address from branch scanner 58 responsive to signals 
from decoder 150. If no signals are received from decoder 
150, LI prefetch control unit 156 prefetches the cache line 
sequential to the target address selected by fetch address 

152. 45 

Indirect addresses and 32-bit target addresses are not 
fetched from L0 I-cache 16 because these types of target 
addresses are typically selected by a programmer when the 
target instruction sequence is not spatially located within 
memory near the branch instruction. Because L0 I-cache 16 50 
stores a small number of cache lines most recently accessed 
in response to the code sequence being executed, it may be 
statistically less likely that the target instruction sequence is 
stored in the L0 I-cache 16. 

Incrementor 154 is configured to increment the fetch 55 
address corresponding to the run selected for dispatch based 
on the branch prediction information received from branch 
history table. 60. Prefetch control unit 50 includes logic (not 
shown) for selecting the run, via instruction select multi- 
plexor 54, based on L0 I-cache hit information as well as the 60 
branch prediction information. This logic also causes incre- 
mentor 154 to increment the fetch address corresponding to 
the selected run (either the sequential fetch address provided 
from register 158 or the target fetch address provided from 
fetch address mux 152). Accordingly, the sequential fetch 65 
address for the subsequent clock cycle is generated and 
stored in register 158. 
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Turning next to FIG. 10, a truth table 160 corresponding 
to one embodiment of decoder 150 employed within one 
embodiment of processor 10 employing the x86 instruction 
set is shown. Other embodiments are possible and contem- 
plated. As shown in FIG. 10, opcodes having the four most 
significant bits equal to (in hexadecimal) 7, E, or 0 result in 
the first target address being selected by fetch address mux 
152. Opcodes having the four most significant bits equal to 
C result in the return address from return stack 64 being 
selected, and opcodes having the four most significant bits 
equal to F cause the sequential address to be selected. 

Branch instruction opcodes having the four most signifi- 
cant bits equal to 7 are conditional jump instructions having 
eight bit relative displacements. Accordingly, an opcode 
corresponding to a set start bit and set control transfer bit 
which has the four most significant bits equal to 7 correctly 
selects the target address provided from run target mux 
144A. Branch instruction opcodes having the four most 
significant bits equal to E may be conditional jump instruc- 
tions with eight bit relative displacements, or call or uncon- 
ditional jump instructions having either eight bit relative 
displacements or 32 bit relative displacements. For these 
cases, decoder 150 selects the first target address provided 
by run target mux 144A and, if further decode indicates that 
a 32-bit displacement field is included in the branch 
instruction, the instructions fetched in response to the selec- 
tion are discarded and the correct fetch address is prefetch 
from LI I-cache 14 via LI prefetch control unit 156 receiv- 
ing the 32-bit fetch address from branch scanner 58. Finally, 
branch instruction opcodes having the four most significant 
bits equal to 0 specify 32-bit relative displacements. Since 
decoder 150 cannot select the 32 bit target address for 
fetching from L0 I-cache 16 in the present embodiment, 
decoder 150 selects the first target address provided from 
branch scanner 58 and signals LI prefetch control unit 156 
to select the 32-bit branch target address from branch 
scanner 58 for prefetching from LI I-cache 14. 

Branch instruction opcodes having the four most signifi- 
cant bits equal to C are return instructions, and hence the 
return address provided by return address stack 64 provides 
the predicted fetch address. On the other hand, branch 
instruction opcodes having the four most significant bits 
equal to F arc call or unconditional jump instructions which 
use indirect target address generation. The indirect address 
is not provided to fetch address mux 152, and hence a default 
selection of the sequential address is performed. The instruc- 
tions fetched in response to the sequential address are 
discarded and instructions prefetched from LI I-cache 14 are 
provided during a subsequent clock cycle. 

As truth table 160 illustrates, predecode of just a portion 
of the instruction byte identified by the start and control 
transfer bits may be used to select a target fetch address for 
L0 I-cache 16. Accordingly, prefetch control unit 50 and 
branch scanner 58 may support high frequency, single cycle 
L0 I-cache access. 

Turning next to FIG. 10A, a flowchart is shown illustrat- 
ing operation of one embodiment of decoder 150. Other 
embodiments are possible and contemplated. While shown 
as a serial series of steps in FIG. 10A, it is understood that 
the steps illustrated may be performed in any suitable order, 
and may be performed in parallel by combinatorial logic 
employed within decoder 150. 

Decoder 150 determines if a branch misprediction is 
being signalled by reorder buffer 28 (decision block 192). If 
a misprediction is signalled, the corrected fetch address 
received from reorder buffer 28 is selected (step 193). On the 
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other hand, if a misprediction is not signalled, decoder 150 indicates which instructions are deleted from instruction 

determines if the second target address corresponding to the queue 20 via signals from forward collapse unit 68 shown in 

second branch instruction identified during the previous FIG. 2. Additionally, the target or sequential address to be 

clock cycle by branch scanner 58 is to be fetched (decision fetched responsive to the first and/or second branch instruc- 

block 194). Hie second target address may be fetched if the 5 tions is indicated followed by parenthetical notation as to 

first branch instruction was predicted not-taken and the which of L0 I-cache 16 (LO notation) or LI I-cache 14 (LI 

second branch instruction was predicted taken. Additionally, notation) the target or sequential address is conveyed, 

the second target address may be fetched if the first branch Turning next to FIG. 13, a block diagram of one cxcm- 

instruction was predicted taken, but was a small forward plary embodiment of instruction queue 20 is shown. Other 

displacement which does not cancel the second branch lQ embodiments are possible and contemplated. In the embodi- 

instruction, and the second branch instruction was predicted ment 0 f FIG. 13, instruction queue 20 includes run storages 

taken. If the second target address is to be fetched, decoder 300A-300B, scan data storages 302A-302B, and address 

150 selects the second target address (which was received in storages 304A-304B. Additionally, instruction queue 20 

the previous clock cycle and is one clock cycle delayed in includes a mux 306 and a control unit 308 A run of 

reaching fetch address mux 152-step 195). Finally, if the instructions is provided Lo instruct ion queue 20 from fetch/ 

second target address is not to be retched, decoder 150 1D ., 1C ^ , ain , . 

s . 4 4 u * ui -ten * -u i u / t scan unit 18 via a mn DUS 310; corresponding scan data is 

operates according to truth table 160 described above (step . 1t , A t j i- 

+ & v r provided on a scan data bus 312; and corresponding 

J' . . t-it ^ a 1 , • i mi . addresses (one per run section) are provided on a run 

turning now to FIG. 11, a flowchart is shown illustrating , , ^ ■* * T , ™ • i 

operation of one embodiment of LI prefetch control unit address ? s bus 14 fiction queue 20 provides a set of 

156. Other embodiments are possible and contemplated. 20 f Gct f ^ruction bytes to alignment unit 22 on instruction 

While shown as a serial series of steps in FIG. 11, it is b y tes bus 316 > P omters to instructions within the instruction 

understood that the steps illustrated may be performed in b y tes on an instruction pointers bus 318, and addresses for 

any suitable order, and may be performed in parallel by the run sections comprising the set of selected instruction 

combinatorial logic employed within LI prefetch control bytes on an addresses bus 320. Run bus 310 is coupled to run 

unit 156. 95 storages 300A-300B, while scan data bus 312 is coupled to 

If a branch misprediction redirection is received by LI scan data storages 302A-302B and address storages 

prefetch control unit 156 (decision block 170), the sequen- 304A-304B are coupled to run addresses bus 314. Storages 

tial cache line to the cache line corresponding to the cor- 300A-300B, 302A-302B, and 304A-304B are coupled to 

rected fetch address is prefetched from LI I-cache 14 (step mux 306, which is further coupled to buses 316-320. 

172). On the other hand, if a branch misprediction redirec- 30 Control unit 308 is coupled to mux 306 and scan data 

tion is not received, LI prefetch control unit 156 determines storages 302A-302B. 

if an L0 I-cache miss has occurred (decision block 174). If Fetch/scan unit 18, and more particularly instruction 

an L0 I-cache miss is detected, the address missing L0 scanner 56 according to the embodiment of FIG. 2, provides 

I-cache 16 is prefetched from LI I-cache 14 (step 176). In a run of instructions and associated information to instruc- 

the absence of an L0 I-cache miss, LI prefetch control unit 35 tion queue 20 via buses 310-314. Control unit 308 allocates 

156 determines if either an indirect target address or a 32-bit one of run storages 300A-300B for the instruction bytes 

relative target address has been detected by decoder 150 comprising the instruction run, and a corresponding scan 

(decision block 178). If such a signal is received, the indirect data storage 302A-302B and address storage 304A-304B 

address received from indirect address cache 66 or the 32-bit for the associated information. The scan data includes 

relative target address received from branch scanner 58 is 40 instruction pointers which identify: (i) the start byte and end 

prefetched from LI I-cache 14 (step 180). Finally, if no byte as offsets within a run section; as well as (ii) the run 

indirect target address or 32-bit relative target address is section within which the instruction resides. According to 

signalled, LI prefetch control unit 156 prefetches the next one particular embodiment, up to five instructions may be 

sequential cache line to the current target fetch address (step identified within an eight byte run section, and there are up 

182). 45 to three run sections in a run for a total of up to 15 

Turning now to FIG. 12. a table 190 is shown illustrating instructions pointers stored within a scan data storage 302. 

the fetch results corresponding to one embodiment of pro- Additionally, address storages 304 store an address corre- 

cessor 10 for various target addresses and branch predictions sponding to each run section. 

corresponding to the first and second branch instructions Control unit 308 examines the instructions pointers within 

identified within an instruction run. Other embodiments are 50 scan data storages 302A-302B to identify instructions 

possible contemplated. As used in table 190, a small forward within a set of contiguous run sections for dispatch to 

target is a target which lies within the current run. alignment unit 22. In one particular embodiment, up to six 

Conversely, a large forward target is a target which does not instructions are identified within up to four contiguous run 

lie within the current run. A target is forward if the target sections. The run sections may be stored in one of run 

address is numerically greater than the address of the branch 55 storages 300A or 300B, or some run sections may be 

instruction, and backward if the target address is numerically selected from one of run storages 300A-300B and the other 

smaller than the address of the branch instruction. The run sections may be selected from the other one of run 

taken/not taken prediction is derived from branch history storages 300A-300B. A first run section is contiguous to a 

table 60. As illustrated by the footnote, results corresponding second run section if the first run section is next, in specu- 

to the second branch prediction may be delayed by a clock 60 lative program order, to the second run section. It is noted 

cycle according to one embodiment. Therefore, processor 10 that mux 306, while illustrated as a single mux in FIG. 13 for 

may assume not taken for the second branch prediction (i.e. simplicity in the drawing, may be implemented by any 

fetch the sequential address) and, if the second branch suitable parallel or cascaded set of multiplexors, 

prediction indicates taken, the fetch may be corrected during Control unit 308 provides a set of selection signals to mux 

the subsequent clock cycle. 65 3 06 to select the set of run sections including the selected 

The result column in table 190 lists several results. The instructions, as well as the instruction pointers correspond- 

term "squash" when used in the result column of table 190 ing to the selected instructions. Additionally, the address for 
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each selected run section is selected. The run sections are 
provided upon instruction bytes bus 316, while the corre- 
sponding instruction pointers and addresses are provided 
upon instruction pointers bus 318 and addresses bus 320, 
respectively. 

Turning next to FIG. 14, a block diagram of one embodi- 
ment of future file 26 and reorder buffer/register file 28 is 
shown in more detail. Other embodiments are possible and 
contemplated. In the embodiment of FIG. 14, future file 26 
is shown along with a register file 28A and a reorder buffer 
28B. Future file 26 is coupled to register file 28A, result 
buses 48, a set of source operand address buses 330, a set of 
source operand buses 332, and a set of lookahead update 
buses 334. Reorder buffer 28B is coupled to register file 
28 A, result buses 48, and dispatched instructions buses 336. 

As instructions are decoded by decode units 70 within 
lookahe ad/collapse unit 24, the register source operands of 
the instructions are routed to future file 26 via source 
operand address buses 330. Future file 26 provides either the 
most current speculative value of each register, if the instruc- 
tion generating the most current value has executed, or a 
reorder buffer tag identifying the instruction which will 
generate the most current value, upon source operands buses 
332. Additionally, one of the source operands may be 
indicated to be a destination operand. Future file 26 updates 
the location corresponding to the destination register with 
the reorder buffer tag to be assigned to the corresponding 
instruction in response to the destination operand. 

Future file 26 additionally receives updates from 
lookahe ad/collapse unit 24. Lookahead results generated by 
lookahead address/result calculation unit 74 are provided to 
future file 26 via lookahead update buses 334. By providing 
lookahead updates from lookahead address/result calcula- 
tion unit 74, speculative execution results may be stored into 
future file 26 more rapidly and may thereby be available 
more rapidly to subsequently executing instructions. Sub- 
sequent instructions may thereby be more likely to achieve 
lookahead result calculation. In one embodiment, to. reduce 
the number of ports on future file 26, the number of 
lookahead updates is limited (for example, 2 updates may be 
allowable). Since the ESP updates are already captured by 
lookahe ad/collapse unit 24, those updates need not be stored 
into future file 26. Furthermore, not every issue position will 
have a speculative update for future file 26. Accordingly, 
fewer speculative updates, on average, may be needed in 
future file 26 and therefore limiting the number of updates 
may not reduce performance. 

Instruction results are provided upon result buses 48. 
Future file 26 receives the results and compares the corre- 
sponding reorder buffer tags (also provided upon result 
buses 48) to the reorder buffer tags stored therein to deter- 
mine whether or not the instruction result comprises the 
most recent speculative update to one of the architected 
registers. If the reorder buffer tag matches one of the reorder 
buffer tags in the future file, the result is capture by future 
file 26 and associated with the corresponding architected 
register. 

Future file 26 is coupled to register file 28A to receive a 
copy of the architected registers stored therein when an 
exception/branch misprediction is detected and retired. 
Reorder buffer 28B may detect exceptions and branch 
mispredictions from the results provided upon result buses 
48, and may signal register file 28 A and future file 26 if a 
copy of the architected registers as retired in register file 28A 
is to be copied to future file 26. For example, upon retiring 
an instruction having an exception or branch misprediction, 
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the copy may be performed. In this manner, future file 26 
may be recovered from incorrect speculative execution. 

Reorder buffer 28B receives the dispatched instructions 
from lookahead/collapse unit 24 via dispatched instructions 

5 bus 336. The dispatched instructions may be provided to 
reorder buffer 28B upon a determination by dispatch control 
unit 76 that instructions are to be dispatched. Additionally, 
reorder buffer 28B receives execution results upon results 
buses 48 and retires the results, in program order, to register 

io file28A. 

Turning now to FIG. 15, a block diagram of one embodi- 
ment of a computer system 200 including processor 10 
coupled to a variety of system components through a bus 
bridge 202 is shown. Other embodiments are possible and 

15 contemplated. In the depicted system, a main memory 204 
is coupled to bus bridge 202 through a memory bus 206, and 
a graphics controller 208 is coupled to bus bridge 202 
through an AGP bus 210. Finally, a plurality of PCI devices 
212A-212B are coupled to bus bridge 202 through a PCI bus 

20 214. A secondary bus bridge 216 may further be provided to 
accommodate an electrical interface to one or more EISA or 
ISA devices 218 through an EISA/ISAbus 220. Processor 10 
is coupled to bus bridge 202 through bus interface 46. 

Bus bridge 202 provides an interface between processor 
10, main memory 204, graphics controller 208, and devices 
attached to PCI bus 214. When an operation is received from 
one of the devices connected to bus bridge 202, bus bridge 
202 identifies the target of the operation (e.g. a particular 
device or, in the case of PCI bus 214, that the target is on PCI 
bus 214). Bus bridge 202 routes the operation to the targeted 
device. Bus bridge 202 generally translates an operation 
from the protocol used by the source device or bus to the 
protocol used by the target device or bus. 

35 In addition to providing an interface to an ISAE1ISA bus 
for PCI bus 214, secondary bus bridge 216 may further 
incorporate additional functionality, as desired. For 
example, in one embodiment, secondary bus bridge 216 
includes a master PCI arbiter (not shown) for arbitrating 

4Q ownership of PCI bus 214. An input/output controller (not 
shown), either external from or integrated with secondary 
bus bridge 216, may also be included within computer 
system 200 to provide operational support for a keyboard 
and mouse 222 and for various serial and parallel ports, as 

45 desired. An external cache unit (not shown) may further be 
coupled to bus interface 46 between processor 10 and bus 
bridge 202 in other embodiments. Alternatively, the external 
cache may be coupled to bus bridge 202 and cache control 
logic for the external cache may be integrated into bus 

5Q bridge 202. 

Main memory 204 is a memory in which application 
programs arc stored and from which processor 10 primarily 
executes. A suitable main memory 204 comprises DRAM 
(Dynamic Random Access Memory), and preferably a plu- 

55 rality of banks of SDRAM (Synchronous DRAM). 

PCI devices 212A-212B are illustrative of a variety of 
peripheral devices such as, for example, network interface 
cards, video accelerators, audio cards, hard or floppy disk 
drives or drive controllers, SCSI (Small Computer Systems 

60 Interface) adapters and telephony cards. Similarly, ISA 
device 218 is illustrative of various types of peripheral 
devices, such as a modem, a sound card, and a variety of data 
acquisition cards such as GPIB or field bus interface cards. 
Graphics controller 208 is provided to control the render- 

65 ing of text and images on a display 226. Graphics controller 
208 may embody a typical graphics accelerator generally 
known in the art to render three-dimensional data structures 
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which can be effectively shifted into and from main memory 
204. Graphics controller 208 may therefore be a master of 
AGP bus 210 in that it can request and receive access to a 
target interface within bus bridge 202 to thereby obtain 
access to main memory 204. A dedicated graphics bus 5 
accommodates rapid retrieval of data from main memory 
204. For certain operations, graphics controller 208 may 
further be configured to generate PCI protocol transactions 
on AGP bus 210. The AGP interface of bus bridge 202 may 
thus include functionality to support both AGP protocol 
transactions as well as PCI protocol target and initiator 
transactions. Display 226 is any electronic display upon 
which an image or text can be presented. A suitable display 
226 includes a cathode ray tube ("CRT"), a liquid crystal 
display ("LCD"), etc. ^ 

It is noted that, while the AGP, PCI, and ISA or EISA 
buses have been used as examples in the above description, 
any bus architectures may be substituted as desired. It is 
further noted that computer system 200 may be a multipro- 
cessing computer system including additional processors 2Q 
(e.g. processor 10a shown as an optional component of 
computer system 200). Processor 10a may be similar to 
processor 10. More particularly, processor 10a may be an 
identical copy of processor 10. Processor 10a may share bus 
interface 46 with processor 10 (as shown in FIG. 15) or may 95 
be connected to bus bridge 202 via an independent bus. 

Numerous variations and modifications will become 
apparent to those skilled in the art once the above disclosure 
is fully appreciated. It is intended that the following claims 
be interpreted to embrace all such variations and modifica- 30 
tions. 

What is claimed is: 

1. A processor comprising: 

a predecode unit configured to predecode a plurality of 
instruction bytes received by said processor, wherein 35 
said predecode unit, upon predecoding a relative con- 
trol transfer instruction comprising a displacement, is 
configured to add an address to said displacement to 
generate a target address corresponding to said relative 
control transfer instruction, and wherein said predecode 40 
unit is configured to replace said displacement within 
said relative control transfer instruction with an 
encoded value indicative of said target address, and 
wherein a control transfer instruction, when executed, 
specifies an address from which a subsequent instruc- 45 
tion to be executed is fetched, and wherein said pre- 
decode unit is configured to generate a plurality of 
control transfer indications, and wherein each one of 
said plurality of control transfer indications corre- 
sponds to a different one of said plurality of instruction 50 
bytes, and wherein said plurality of control transfer 
indications identify control transfer instructions includ- 
ing said relative control transfer instruction; and 

an instruction cache coupled to said predecode unit, 
wherein said instruction cache is configured to store 55 
said plurality of instruction bytes including said relative 
control transfer instruction with said encoded value in 
place of said displacement, and wherein said instruc- 
tion cache is configured to store said plurality of control 
transfer indications. 60 

2. The processor as recited in claim 1 wherein said 
displacement includes fewer bits than said target address. 

3. The processor as recited in claim 2 wherein said 
encoded value includes a first field and a second field. 

4. The processor as recited in claim 4 wherein said first 65 
field comprises an offset within a target cache line of a byte 
identified by said target address. 
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5. A method for generating a target address for a relative 
control transfer instruction, the method comprising: 

predecoding a plurality of instruction bytes including said 
relative transfer instruction to detect a presence of said 
relative control transfer instruction, wherein a control 
transfer instruction, when executed, specifies an 
address from which a subsequent instruction to be 
executed is fetched, and wherein said predecoding 
comprises generating a plurality of control transfer 
indications, wherein each of said plurality of control 
transfer indications corresponds to a different one of 
said plurality of instruction bytes, and wherein said 
plurality of control transfer indications identify control 
transfer instructions starting at said different ones of 
said plurality of instruction bytes; 

adding an address to a displacement included in said 
relative control transfer instruction, thereby generating 
said target address; 

replacing said displacement within said relative control 
transfer instruction with an encoding indicative of said 
target address; and 

storing said plurality of instruction bytes including said 
relative control transfer instruction, with said displace- 
ment replaced by said encoding, and said plurality of 
control transfer indications in an instruction cache. 

6. A predecode unit comprising: 

a decoder configured to decode a plurality of instruction 
bytes and to identify a relative control transfer instruc- 
tion therein, wherein a control transfer instruction, 
when executed, specifies an address from which a 
subsequent instruction to be executed is fetched, and 
wherein said decoder is configured to generate a plu- 
rality of control transfer indications, and wherein each 
one of said plurality of control transfer indications 
corresponds to a different one of said plurality of 
instruction bytes, and wherein said plurality of control 
transfer indications identify control transfer instruc- 
tions including said relative control transfer instruction; 
and 

a target generator configured to add a displacement 
selected from said relative control transfer instruction 
to an address, thereby generating a target address 
corresponding to said relative control transfer 
instruction, and further configured to generate an 
encoding of said target address with which said prede- 
code unit replaces said displacement within said rela- 
tive control transfer instruction. 

7. The predecode unit as recited in claim 6 wherein said 
target generator includes a sign extend block configured to 
sign extend said displacement to a number of bits in said 
address. 

8. The predecode unit as recited in claim 7 wherein said 
target generator further includes: 

an adder coupled to said sigu extend block and to receive 
said address, wherein said adder is configured to add 
said sign extended displacement and said address to 
generate said target address; and 

a displacement encoder coupled to said adder, wherein 
said displacement encoder is configured to encode said 
target address. 

9. The predecode unit as recited in claim 8 wherein said 
displacement encoder is configured to encode said target 
address as: (i) an offset within a target cache line of a byte 
identified by said target address; and (ii) a number of cache 
lines above or below a cache line storing said relative control 
transfer instruction at which said target cache line is stored. 
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10. A computer system comprising: 

a processor configured to predecode a plurality of instruc- 
tion bytes received by said processor, wherein said 
processor, upon predecoding a relative control transfer 
instruction comprising a displacement, is configured to 5 
add an address to said displacement to generate a target 
address corresponding to said relative control transfer 
instruction, and wherein said processor is configured to 
replace said displacement within said relative control 
transfer instruction with an encoded value indicative of 10 
said target address, and wherein a control transfer 
instruction, when executed, specifies an address from 
which a subsequent instruction to be executed is 
fetched, and wherein said processor, during 
predecoding, is configured to generate a plurality of 15 
control transfer indications, and wherein each one of 
said plurality of control transfer indications corre- 
sponds to a different one of said plurality of instruction 
bytes, and wherein said plurality of control transfer 
indications identify control transfer instructions includ- 20 
ing said relative control transfer instruction; 

a memory coupled to said processor, wherein said 
memory is configured to store said plurality of instruc- 
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tion bytes and to provide said instruction bytes to said 
processor; and 
a peripheral device configured to transfer data between 
said computer system and another computer system. 

11. The computer system as recited in claim 10 wherein 
said peripheral device is a modem. 

12. The computer system as recited in claim 10 further 
comprising an audio peripheral device. 

13. The computer system as recited in claim 12 wherein 
said audio I/O device comprises a sound card. 

14. The computer system as recited in claim 10 further 
comprising a second processor configured to predecode a 
plurality of instruction bytes received by said second 
processor, wherein said second processor, upon predecoding 
a relative control transfer instruction comprising a 
displacement, is configured to add an address to said dis- 
placement to generate a target address corresponding to said 
relative control transfer instruction, and wherein said second 
processor is configured to replace said displacement within 
said relative control transfer instruction with an encoded 
value indicative of said target address. 

# # # # He 
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ABSTRACT 



A graphics system that is configured to utilize a sample 
buffer and a plurality of parallel sample-to-pixel calculation 
units, wherein the sample-pixel calculation units are con- 
figured to access different portions of the sample buffer in 
parallel. The graphics system may include a graphics 
processor, a sample buffer, and a plurality of sample-to-pixel 
calculation units. The graphics processor is configured to 
receive a set of three-dimensional graphics data and render 
a plurality of samples based on the graphics data. The 
sample buffer is configured to store the plurality of samples 
for the sample-to-pixel calculation units, whicb are config- 
ured to receive and filter samples from the sample buffer to 
create output pixels. Each of the sample-to-pixel calculation 
units are configured to generate pixels corresponding to a 
different region of the image. The region may be a vertical 
or horizontal stripe of the image, or a rectangular portion of 
the image. Each region may overlap the other regions of the 
image to prevent visual aberrations. 

52 Claims, 25 Drawing Sheets 
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GRAPHICS SYSTEM CONFIGURED TO 
PERFORM PARALLEL SAMPLE TO PIXEL 
CALCULATION 

This application is a continuation-in-part of co-pending 
application Ser. No. 09/251,844 titled "Graphics System 
With Programmable Real-Time Alpha Key Generation", 
filed on Feb. 17, 1999, which claims the benefit of U.S. 
Provisional Application No. 60/074,836, filed Feb. 17, 1998. 
These applications are hereby incorporated by reference in 
their entirety. 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

This invention relates generally to the field of computer 
graphics and, more particularly, to high performance graph- 
ics systems. 

2. Description of the Related Art 

A computer system typically relies upon its graphics 
system for producing visual output on the computer screen 
or display device. Early graphics systems were only respon- 
sible for taking what the processor produced as output and 
displaying it on the screen. In essence, they acted as simple 
translators or interfaces. Modem graphics systems, however, 
incorporate graphics processors with a great deal of pro- 
cessing power. They now act more like coprocessors rather 
than simple translators. This change is due to the recent 
increase in both the complexity and amount of data being 
sent to the display device. For example, modem computer 
displays have many more pixels, greater color depth, and are 
able to display more complex images with higher refresh 
rates than earlier models. Similarly, the images displayed are 
now more complex and may involve advanced techniques 
such as anti-aliasing and texture mapping. 

As a result, without considerable processing power in the 
graphics system, the CPU would spend a great deal of time 
performing graphics calculations. This could rob the com- 
puter system of the processing power needed for performing 
other tasks associated with program execution and thereby 
dramatically reduce overall system performance. With a 
powerful graphics system, however, when the CPU is 
instructed to draw a box on the screen, the CPU is freed from 
having to compute the position and color of each pixel. 
Instead, the CPU may send a request to the video card stating 
"draw a box at these coordinates." Hie graphics system then 
draws the box, freeing the processor to perform other tasks. 

Generally, a graphics system in a computer (also referred 
to as a graphics system) is a type of video adapter that 
contains its own processor to boost performance levels. 
These processors are specialized for computing graphical 
transformations, so they tend to achieve better results than 
the general-purpose CPU used by the computer system. In 
addition, they free up the computer's CPU to execute other 
commands while the graphics system is handling graphics 
computations. The popularity of graphical applications, and 
especially multimedia applications, has made high perfor- 
mance graphics systems a common feature of computer 
systems. Most computer manufacturers now bundle a high 
performance graphics system with their systems. 

Since graphics systems typically perform only a limited 
set of functions, they may be customized and therefore far 
more efficient at graphics operations than the computer's 
general-purpose central processor. While early graphics sys- 
tems were limited to performing two-dimensional (2D) 
graphics, their functionality has increased to support three- 
dimensional (3D) wire-frame graphics, 3D solids, and now 
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includes support for three-dimensional (3D) graphics with 
textures and special effects such as advanced shading, 
fogging, alpha-blending, and specular highlighting. 

The processing power of 3D graphics systems has been 

5 improving at a breakneck pace. A few years ago, shaded 
images of simple objects could only be rendered at a few 
frames per second, while today's systems support rendering 
of complex objects at 60 Hz or higher. At this rate of 
increase, in the not too distant future, graphics systems will 

10 literally be able to render more pixels than a single human's 
visual system can perceive. While this extra performance 
may be useable in multiple-viewer environments, it may be 
wasted in more common primarily single-viewer environ- 
ments. Thus, a graphics system is desired which is capable 

15 of matching the variable nature of the human resolution 
system (i.e., capable of putting the quality where it is needed 
or most perceivable). 

While the number of pixels is an important factor in 
determining graphics system performance, another factor of 

20 equal import is the quality of the image. For example, an 
image with a high pixel density may still appear unrealistic 
if edges within the image are too sharp or jagged (also 
referred to as "aliased"). One well-known technique to 
overcome these problems is anti-aliasing. Anti-aliasing 

25 involves smoothing the edges of objects by shading pixels 
along the borders of graphical elements. More specifically, 
anti-aliasing entails removing higher frequency components 
from an image before they cause disturbing visual artifacts. 
For example, anti-aliasing may soften or smooth high con- 

30 trast edges in an image by forcing certain pixels to inter- 
mediate values (e.g., around the silhouette of a bright object 
superimposed against a dark background). 

Another visual effect used to increase the realism of 

35 computer images is alpha blending. Alpha blending is a 
technique that controls the transparency of an object, allow- 
ing realistic rendering of translucent surfaces such as water 
or glass. Another effect used to improve realism is fogging. 
Fogging obscures an object as it moves away from the 

4Q viewer. Simple fogging is a special case of alpha blending in 
which the degree of alpha changes with distance so that the 
object appears to vanish into a haze as the object moves 
away from the viewer. This simple fogging may also be 
referred to as "depth cueing" or atmospheric attenuation, 

45 i.e., lowering the contrast of an object so that it appears less 
prominent as it recedes. More complex types of fogging go 
beyond a simple linear function to provide more complex 
relationships between the level of translucence and an 
object's distance from the viewer. Current state of the art 

5Q software systems go even further by utilizing atmospheric 
models to provide low-lying fog with improved realism. 

While the techniques listed above may dramatically 
improve the appearance of computer graphics images, they 
also have certain limitations. In particular, they may intro- 

55 duce their own aberrations and are typically limited by the 
density of pixels displayed on the display device. 

As a result, a graphics system is desired which is capable 
of utilizing increased performance levels to increase not 
only the number of pixels rendered but also the quality of the 

60 image rendered. In addition, a graphics system is desired 
which is capable of utilizing increases in processing power 
to improve the results of graphics effects such as anti- 
aliasing. 

Prior art graphics systems have generally fallen short of 
65 these goals. Prior art graphics systems use a conventional 
frame buffer for refreshing pixel/video data on the display. 
The frame buffer stores rows and columns of pixels that 
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exactly correspond to respective row and column locations 
on the display. Prior art graphics system render 2D and/or 
3D images or objects into the frame buffer in pixel form, and 
then read the pixels from the frame buffer during a screen 
refresh to refresh the display. Thus, the frame buffer stores 
the output pixels that are provided to the display. To reduce 
visual artifacts that may be created by refreshing the screen 
at the same time the frame buffer is being updated, most 
graphics systems' frame buffers are double-buffered. 

To obtain more realistic images, some prior art graphics 
systems have gone further by generating more than one 
sample per pixel. As used herein, the term "sample" refers 
to calculated color information that indicates the color, depth 
(z), transparency, and potentially other information, of a 
particular point on an object or image. For example a sample 
may comprise the following component values: a red value, 
a green value, a blue value, a z value, and an alpha value 
(e.g., representing the transparency of the sample). A sample 
may also comprise other information, e.g., a z-depth value, 
a blur value, an intensity value, brighter-than-bright 
information, and an indicator that the sample consists par- 
tially or completely of control information rather than color 
information (i.e., "sample control information"). By calcu- 
lating more samples than pixels (i.e., super- samp ling), a 
more detailed image is calculated than can be displayed on 
the display device. For example, a graphics system may 
calculate four samples for each pixel to be output to the 
display device. After the samples are calculated, they are 
then combined or filtered to form the pixels that are stored 
in the frame buffer and then conveyed to the display device. 
Using pixels formed in this manner may create a more 
realistic final image because overly abrupt changes in the 
image may be smoothed by the filtering process. 

These prior art super-sampling systems typically generate 
a number of samples that are far greater than the number of 
pixel locations on the display. These prior art systems 
typically have rendering processors that calculate the 
samples and store them into a render buffer. Filtering hard- 
ware then reads the samples from the render buffer, filters 
the samples to create pixels, and then stores the pixels in a 
traditional frame buffer. The traditional frame buffer is 
typically double-buffered, with one side being used for 
refreshing the display device while the other side is updated 
by the filtering hardware. Once the samples have been 
filtered, the resulting pixels are stored in a traditional frame 
buffer that is used to refresh to display device. These 
systems, however, have generally suffered from limitations 
imposed by the conventional frame buffer and by the added 
latency caused by the render buffer and filtering. Therefore, 
an improved graphics system is desired which includes the 
benefits of pixel super- samp ling while avoiding the draw- 
backs of the conventional frame buffer. 

U.S. patent application Ser. No. 09/251,844 titled "Graph- 
ics System with a Variable Resolution Sample Buffer" 
discloses a computer graphics system that utilizes a super- 
sampled sample buffer and a sample-to-pixel calculation 
unit for refreshing the display. The graphics processor 
generates a plurality of samples and stores them into a 
sample buffer. The graphics processor preferably generates 
and stores more than one sample for at least a subset of the 
pixel locations on the display. Thus, the sample buffer is a 
super-sampled sample buffer which stores a number of 
samples that may be far greater than the number of pixel 
locations on the display. The sample-to-pixel calculation 
unit is configured to read the samples from the super- 
sampled sample buffer and filter or convolve the samples 
into respective output pixels, wherein the output pixels are 
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then provided to refresh the display. The sample-to-pixel 
calculation unit selects one or more samples and filters them 
to generate an output pixel. The sample-to-pixel calculation 
unit may operate to obtain samples and generate pixels 
5 which are provided directly to the display with no frame 
buffer there between. 

SUMMARY OF THE INVENTION 

The problems set forth above may at least in part be 
solved by a graphics system that is configured to utilize a 

10 sample buffer and a plurality of parallel sample-to-pixel 
calculation units, wherein the sample-pixel calculation units 
are configured to access different portions of the sample 
buffer in parallel. Advantageously, this configuration 
(depending upon the embodiment) may also allow the 

15 graphics system to use a sample buffer in lieu of a traditional 
frame buffer that stores pixels. Since the sample-to-pixel 
calculation units may be configured to operate in parallel, 
the latency of the graphics system may be reduced in some 
embodiments. 

20 In one embodiment, the graphics system may include one 
or more graphics processors, a sample buffer, and a plurality 
of sample-to-pixel calculation units. The graphics proces- 
sors may be configured to receive a set of three-dimensional 
graphics data and render a plurality of samples based on the 

25 graphics data. The sample buffer may be configured to store 
the plurality of samples (e.g., in a double-buffered 
configuration) for the sample-to-pixel calculation units, 
which are configured to receive and filter samples from the 
sample buffer to create output pixels. The output pixels are 

30 usable to form an image on a display device. Each of the 
sample-to-pixel calculation units are configured to generate 
pixels corresponding to a different region of the image. The 
region may be a vertical stripe (i.e., a column) of the image, 
a horizontal stripe (i.e., a row) of the image, or a rectangular 

35 portion of the image. Note, as used herein the terms "hori- 
zontal row" and "horizontal stripe" are used 
interchangeably, as are "vertical column" and "vertical 
stripe". Each region may overlap the other regions of the 
image to prevent visual aberrations (e.g., seams, lines, or 

40 tears in the image). As previously noted, each of the sample- 
to-pixel calculation units may advantageously be configured 
to operate in parallel on its own region or regions. The 
sample-to-pixel calculation units are configured to process 
the samples by (i) determining which samples are within a 

45 predetermined filter envelope, (ii) multiplying those samples 
by a weighting, (hi) summing the resulting values, and (iv) 
normalizing the results to form output pixels. The weighting 
value may vary with respect the sample's position within the 
filter envelope (e.g., the weighting factor may decrease as 

50 the samples move farther from the center of the filter 
envelope). In some embodiments, the weighting factor may 
be normalized or pre-normalized, in which case the resulting 
output pixel will not proceed through normalization because 
the output will already be normalized. Normalized weight- 

55 ing factors are adjusted to ensure that pixels generated with 
fewer contributing samples will not overpower pixels gen- 
erated with more contributing samples. In contrast, if 
un-normalized weighting factors are used, the resulting pixel 
will typically proceed through normalization. Normalization 

60 will typically be performed in embodiments of the graphics 
system that allow for a variable number of samples to 
contribute to each output pixel. Normalization may also be 
performed in systems that allow variable sample patterns, 
and in systems in which the pitch of the centers of filters vary 

65 widely with respect to the sample pattern. 

In some embodiments, the graphics system may be con- 
figured to dynamically change the size or type of regions 
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being used (e.g., changing the width of the vertical columns FIG. 9 illustrates details of one embodiment of a sample 

used on a frame-by-frame basis). Some embodiments of the positioning scheme; 

graphics system may support a variable resolution or van- FIG . 10 illustrates details of another embodiment of a 

able density frame buffer. In these configurations, the graph- sample positioning scheme; 

ics system is configured to render samples more densely in 5 h * •« * * j * -i ? u a- * c 

J . r t • / i j? i • i FIG. 11A illustrates details of one embodiment of a 

certain areas or the image (e.g., the center ot the image or the . r> , . . 1 + • i • 

„ , ii. , • • graphics system configured to convert samples to pixels in 

portion or the image where the viewer s attention is most n i ^ i * • / i \ 

f„ , r ,x AJ i , «.,• _j . „ parallel using vertical screen stripes (columns); 
likely focused). Advantageously, the ability to dynamically 

vary the size and/or shape of the regions used may allow the FIG - 11B ^ustrates details of another embodiment of a 

graphics system to equalize (or come closer to equalizing) io § ra P hlcs s y stem con % ured to convert sam P les to P lxels in 

the number of samples that each sample-to-pixel calculation P arallel usin § vertlcal screen stn P es (columns); 

unit processes for a particular frame. FIG. 12 illustrates details of another embodiment of a 

The samples may include color components and alpha graphics system configured to convert samples to pixels in 

(e.g., transparency) components, and may be stored in parallel using horizontal screen stripes (rows); 

"bins" to simplify the process of storing and retrieving 15 FIG. 13 illustrates details of another embodiment of a 

samples from the sample buffer. As described in greater graphics system configured to convert samples to pixels in 

detail below, bins are a means for organizing and dividing parallel using rectangular regions; 

the sample buffer into smaller sets of storage locations. In FIG. 14 illustrates details of one method for reading 

addition, in some embodiments the three-dimensional samples from a sample buffer; 

graphics data may be received in a compressed form (e.g., 20 pTG 15 illustmtes details of ' one embodinient of a melhod 

using geometry compression). In these embodiments the for dcaHng with boundary con ditions; 

graphics processors may be configured to decompress the imp 1 , 1. r 

; u A . . , u . j + u * a • + u FIG. 16 illustrates details of another embodiment of a 

three-dimensional graphics data before rendering the , , r . . , , 

1 * j u • *u * « 1 4. » method tor dealing with boundary conditions; 

samples. As used herein, the term color components & J ' 

includes information on a per-sample or per-pixel basis that 25 FIG. 17 is a flowchart illustrating one embodiment of a 

is usable to determine the color the pixel or sample. For method for drawing samples into a super-sampled sample 

example, RGB information and transparency information buffer; 

may be color components. FIG. 18 illustrates one embodiment of a method for 

A method for rendering a set of three-dimensional graph- coding triangle vertices; 

ics data is also contemplated. In one embodiment the method 30 FIG. 19 illustrates one embodiment of a method for 

comprises: (i) receiving the three-dimensional graphics data, calculating pixels from samples; 

(ii) generating one or more samples based on the graphics FIG. 20 illustrates details of one embodiment of a sample 

data, (iii) storing the samples, (iv) selecting stored samples; to pixel calculation for an example set of samples; 

and (iv) filtering the selected samples in parallel to form mQ 2± on£ embodiment of a memod fof 

outpu pixels. The stored samples may be selected according 35 ^ rf Qf 

to a plurality ol regions, as described above. 

FIG. 22 illustrates another embodiment of a method for 

BRIEF DESCRIPTION OF THE DRAWINGS varying the density of samples; 

The foregoing, as well as other objects, features, and FIG. 23 illustrates yet another embodiment of a method 

advantages of this invention may be more completely under- 40 for varying the density of samples; 

stood by reference to the following detailed description FIGS 2 4A-B illustrate details of one embodiment of a 

when read together with the accompanying drawings in method for utilizing eye-tracking to vary the density of 

which: samples; and 

FIG. 1A illustrates one embodiment of a computer system mGS 25A _ B Qf Qne embodiment of a 

that includes one embodiment of a graphics system; 45 method for eye . tracking to vary the density of 

FIG. IB illustrates another embodiment of a computer samples 

system that is part of a virtual reality work station; TT71 a1 iM . . t . n 

' ^ - .„ , - , . , While the invention is susceptible to various modifica- 

FIG. 2 illustrates one embodiment of a network to which tions and alternative forms? ific embodiments thereof 

the computers systems of FIGS. 1A-B maybe connected; afe showfl by way of exampk [n the dfawings md w[n 

FIG. 3A is a diagram illustrating another embodiment of ^ herein be described in detaiL It should be understood, 

the graphics system of FIG. 1 as a virtual reality work however, that the drawings and detailed description thereto 

station, are no j intended to limit the invention to the particular form 

FIG. 3B is more detailed diagram illustrating one embodi- disclosed, but on the contrary, the intention is to cover all 

ment of a graphics system with a sample buffer; modifications, equivalents, and alternatives falling within 

FIG. 4 illustrates traditional pixel calculation; the spirit and scope of the present invention as defined by the 

FIG. 5A illustrates one embodiment of super-sampling; appended claims. 

FIG. 5B illustrates a random distribution of samples; 

FIG. 6 illustrates details of one embodiment of a graphics DETAILED DESCRIPTION OF SEVERAL 

system having one embodiment of a variable resolution 60 EMBODIMENTS 

super-sampled sample buffer; ^ 

FIG. 7 illustrates details of another embodiment of a - P y • 

graphics system having one embodiment of a variable Referring now to FIG. 1A, one embodiment of a computer 

resolution super-sampled sample buffer and a double buff- system 80 that includes a three-dimensional (3-D) graphics 

ered sample position memory; 65 system is shown. The 3-D graphics system may be com- 

FIG. 8 illustrates details of three different embodiments of prised in any of various systems, including a computer 

sample positioning schemes; system, network PC, Internet appliance, a television, includ- 
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ing HDTV systems and interactive television systems, per- 
sonal digital assistants (PDAs), and other devices which 
display 2D and/or 3D graphics, among others. 

As shown, the computer system 80 comprises a system 
unit 82 and a video monitor or display device 84 coupled to 
the system unit 82. The display device 84 may be any of 
various types of display monitors or devices (e.g., a CRT, 
LCD, or gas-plasma display). Various input devices may be 
connected to the computer system, including a keyboard 86 
and/or a mouse 88, or other input device (e.g., a trackball, 
digitizer, tablet, six-degree of freedom input device, head 
tracker, eye tracker, data glove, body sensors, etc.). Appli- 
cation software may be executed by the computer system 80 
to display 3-D graphical objects on display device 84. As 
described further below, the 3-D graphics system in com- 
puter system 80 includes a super- samp led sample buffer 
with a programmable real-time sample-to-pixel calculation 
unit to improve the quality and realism of images displayed 
on display device 84. 

Computer system 80 may also include eye-tracking sensor 
92 and/or 3D -glasses 90. 3D glasses 90 may be active (e.g., 
LCD shutter-type) or passive (e.g., polarized, red-green, 
etc.) and may allow the user to view a more three- 
dimensional image on display device 84. With glasses 90, 
each eye receives a slightly different image, which the 
viewer's mind interprets as a "true" three-dimensional view. 
Sensor 92 may be configured to determine which part of the 
image on display device 84 that the viewer is looking at (i.e., 
that the viewer's field of view is centered on). The infor- 
mation provided by sensor 92 may used in a number of 
different ways as will be described below. 

Virtual Reality Computer System — FIG. IB 

FIG. IB illustrates another embodiment of a computer 
system 70. In this embodiment, the system comprises a 
head-mounted display device 72, head-tracking sensors 74, 
and a data glove 76. Head mounted display 72 may be 
coupled to system unit 82 via a fiber optic link 94, or one or 
more of the following: an electrically-conductive link, an 
infra-red link, or a wireless (e.g., RF) link. Other embodi- 
ments are possible and contemplated. 

Computer Network — FIG. 2 

Referring now to FIG. 2, a computer network 500 is 
shown comprising at least one server computer 502 and one 
or more client computers 506A-N. (In the embodiment 
shown in FIG. 4, client computers 506 A-B are depicted). 
One or more of the client systems may be configured 
similarly to computer system 80, with each having one or 
more graphics systems 112 as described above. Server 502 
and client(s) 506 may be joined through a variety of con- 
nections 504, such as a local-area network (LAN), a wide- 
area network (WAN), or an Internet connection. In one 
embodiment, server 502 may store and transmit 3-D geom- 
etry data (which may be compressed) to one or more of 
clients 506. The clients 506 receive the compressed 3-D 
geometry data, decompress it (if necessary) and then render 
the geometry data. The rendered image is then displayed on 
the client's display device. The clients render the geometry 
data and display the image using super-sampled sample 
buffer and real-time filter techniques described above. In 
another embodiment, the compressed 3-D geometry data 
may be transferred between client computers 506. 

Computer System Block Diagram — FIG. 3A 

FIG. 3A presents a simplified block diagram for computer 
system 80. Elements of computer system 80 that are not 
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necessary for an understanding of the present invention are 
suppressed for convenience. Computer system 80 comprises 
a host central processing unit (CPU) 102 and a 3-D graphics 
system 112 coupled to system bus 104. A system memory 

5 106 may also be coupled to system bus 104. 

Host CPU 102 may be realized by any of a variety of 
processor technologies. For example, host CPU 102 may 
comprise one or more general purpose microprocessors, 
parallel processors, vector processors, digital signal 

10 processors, etc., or any combination thereof. System 
memory 106 may include one or more memory subsystems 
representing different types of memory technology. For 
example, system memory 106 may include read-only 
memory (ROM), random access memory (RAM) — such as 

15 static random access memory (SRAM), synchronous 
dynamic random access memory (SDRAM), and Rambus 
dynamic random access memory (RDRAM) — and mass 
storage devices. 

System bus 104 may comprise one or more communica- 

20 tion buses or host computer buses (for communication 
between host processors and memory subsystems). In 
addition, various peripheral devices and peripheral buses 
may be connected to system bus 104. 

Graphics system 112 is configured according to the prin- 

25 ciples of the present invention, and may couple to system 
bus 104 by a crossbar switch or any other type of bus 
connectivity logic. Graphics system 112 drives each of 
projection devices PD^-PD^ and display device 84 with a 

3q corresponding video signal. 

It is noted that the 3-D graphics system 112 may couple 
to one or more busses of various types in addition to system 
bus 104. Furthermore, the 3D graphics system 112 may 
couple to a communication port, and thereby, directly 

35 receive graphics data from an external source such as the 
Internet or a local area network. 

Host CPU 102 may transfer information to/from graphics 
system 112 according to a programmed input/output (I/O) 
protocol over system bus 104. Alternately, graphics system 

40 112 may access system memory 106 according to a direct 
memory access (DMA) protocol or through intelligent bus- 
mastering. 

A graphics application program conforming to an appli- 
cation programming interface (API) such as OpenGL® (a 

45 registered trademark of Silicon Graphics, Inc.) or Java3D™ 
(a trademark of Sun Microsystems, Inc.) may execute on 
host CPU 102 and generate commands and data that define 
a geometric primitive such as a polygon for output on 
projection devices PD ; through PD Z and/or display device 

50 84. Host CPU 102 may transfer this graphics data to system 
memory 106. Thereafter, the host CPU 102 may transfer the 
graphics data to graphics system 112 over system bus 104. 
In another embodiment, graphics system 112 may read 
geometry data arrays from system memory 106 using DMA 

55 access cycles. In yet another embodiment, graphics system 
112 may be coupled to system memory 106 through a direct 
port, such as an Advanced Graphics Port (AGP) promul- 
gated by Intel Corporation. 

Graphics system 112 may receive graphics data from any 

60 of various sources including host CPU 102, system memory 
106 or any other memory, external sources such as a network 
(e.g., the Internet) or a broadcast medium (e.g. television). 

As will be described below, graphics system 112 may be 
configured to allow more efficient microcode control, which 

65 results in increased performance for handling of incoming 
color values corresponding to the polygons generated by 
host CPU 102. 
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While graphics system 112 is depicted as part of computer system 80 in a compressed form. Graphics data compression 

system 80, graphics system 112 may also be configured as may advantageously reduce the required transfer bandwidth 

a stand-alone device. Graphics system 112 may also be between computer system 80 and graphics system 112. In 

configured as a single chip device or as part of a system- one embodiment, control unit 140 may be configured to split 

on-a-chip or a multi-chip module. 5 and route tne received data stream to rendering units 

Graphics system 112 may be comprised in any of various 15 °A-D in compressed form, 

systems, including a network PC, an Internet appliance, a The § ra P h A lcs dat * may comprise one or more graphics 

television (including an HDTV system or an interactive Primitives. As used herein, the term graphics primitive 

television system), a personal digital assistant (PDA), or includes polygons, parametric surfaces splines, NURBS 

other devices which display 2D and/or 3D graphics. io (non-uniform rational B-splines), subdivision surfaces, 

, , . fractals, volume primitives, and particle systems. These 

As described further below the 3-D graphics system Mcs imitives are described ^ detail in the text book 

within in computer system 80 includes a super-sampled entffled « c ter Graphics: Principles and Practice" by 

sample buffer and a plurality of programmable sample-to- James D Fq1 et ^ blished b A ddison-Wesley Pub- 

pixel calculation units to improve the quality and realism or fishin^ Co Inc 1996 

images displayed by projection devices PD, through FD L 15 It £ Qot ; d ^ the ' embodiments and examp i es of the 

and/or display device 84 bach sample-to-pixe calculation invention presented berein are described in terms of polv- 

umt may include a filter (i.e., convolution) pipeline or other for ^ safe of si licit However, any type of 

hardware for generating pixel values (e.g. red green ^and faics imitiye fee used mstead of Qr ^ addition (0 

blue values) based on samples in the sample buffer. Each ol in these embo diments and examples, 

sample-to-pixel calculation unit may obtain samples from ^ Rendering Units 

the sample buffer and generate pixel values which are Rendering units i 50A -D (also referred to herein as draw 

provided to any of projection devices PD, through PD^ or units) m confi d to receive hics instructions and 

display device 84. The sample-to-pixel calculation units daU rrom cuntrol ^ 14Q and Lhen fofm a number of 

may operate in a ■ real-time or on-the-fly fashion. ^ mnctions which depend on the exact implementation. For 

As used herein the terms "filter" and "convolve" are used example, rendering units 150A-D may be configured to 

interchangeably. As used herein, the term "real-time" refers perform decompression (if the received graphics data is 

to a process or operation that is performed at or near the presented in compressed form), transformation, clipping, 

refresh rate of projection devices PD ; through PD Z or display lighting, texturing, depth cueing, transparency processing, 

device 84. The term "on-the-fly" refers to a process or 3Q set _ U p ; visible object determination, and virtual screen ren- 

operation that generates images at a rate near or above the dering of various graphics primitives occurring within the 

minimum rate required for displayed motion to appear graphics data. 

smooth (i.e., motion fusion) and for the light intensity to Depending upon the type of compressed graphics data 

appear continuous (i.e., flicker fusion). These concepts are received, rendering units 150A-D may be configured to 

further described in the book "Spatial Vision" by Russel L. 35 perform arithmetic decoding, run-length decoding, Huffman 

De Valois and Karen K. De Valois, Oxford University Press, decoding, and dictionary decoding (e.g., LZ77, LZSS, 

1988. LZ78, and LZW). In another embodiment, rendering units 

_ , . „ 150A-D may be configured to decode graphics data that has 

Graphics System— FIG. 3B u y A • * - - n * • 

v J been compressed using geometric compression. Geometric 

FIG. 3B presents a block diagram for one embodiment of 40 compression of 3D graphics data may achieve significant 

graphics system 112 according to the present invention. reductions in data size while retaining most of the image 

Graphics system 112 may comprise a graphics processing quality. Two methods for compressing and decompressing 

unit (GPU) 90, one or more super-sampled sample buffers 3D geometry are described in: 

162, and one or more sample-to-pixel calculation units U.S. Pat. No. 5,793,371, application Ser. No. 08/5 If ,294, 

170-1 through 170- V. Graphics system 112 may also com- 45 filed on Aug. 4, 1995, entitled "Method And Apparatus 

prise one or more digital-to-analog converters (DACs) 178-1 For Geometric Compression Of Three-Dimensional 

through 178-L. Graphics processing unit 90 may comprise Graphics Data," Attorney Docket No. 5181-05900; and 

any combination of processor technologies. For example, U.S. patent application Ser. No. 09/095,777, filed on Jun. 

graphics processing unit 90 may comprise specialized 11, 1998, entitled "Compression of Three-Dimensional 

graphics processors or calculation units, multimedia 50 Geometry Data Representing a Regularly Tiled Surface 

processors, DSPs, or general purpose processors. Portion of a Graphical Object," Attorney Docket No. 

In one embodiment, graphics processing unit 90 may 5181-06602. 

comprise one or more rendering units 150A-D. Graphics In embodiments of graphics system 112 that support 

processing unit 90 may also comprise one or more control decompression, the graphics data received by each rendering 

units 140, one or more data memories 152A-D, and one or 55 unit 150 is decompressed into one or more graphics "primi- 

more schedule units 154. Sample buffer 162 may comprise tives" which may then be rendered. The term primitive 

one or more sample memories 160A-160N. refers to components of objects that define its shape (e.g., 

A. Control Unit 140 points, lines, triangles, polygons in two or three dimensions, 

Control unit 140 operates as the interface between graph- polyhedra, voxels, or free-form surfaces in three 

ics system 112 and computer system 80 by controlling the 60 dimensions). Each rendering unit 150 may be any suitable 

transfer of data between graphics system 112 and computer type of high performance processor (e.g., a specialized 

system 80. In embodiments of graphics system 112 that graphics processor or calculation unit, a multimedia 

comprise two or more rendering units 150A-D, control unit processor, a digital signal processor, or a general purpose 

140 may also divide the stream of data received from processor). 

computer system 80 into a corresponding number of parallel 65 Transformation refers to applying a geometric operation 

streams that are routed to the individual rendering units to aprimitive or an object comprising a set of primitives. For 

150A-D. The graphics data may be received from computer example, an object represented by a set of vertices in a local 
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coordinate system may be embedded with arbitrary position, 
orientation, and size in world space using an appropriate 
sequence of translation, rotation, and scaling transforma- 
tions. Transformation may also comprise reflection, 
skewing, or any other affine transformation. More generally, 
transformations may comprise nonlinear operations. 

Lighting refers to calculating the illumination of objects. 
Lighting computations result in an assignment of color 
and/or brightness to objects or to selected points (e.g. 
vertices) on objects. Depending upon the shading algorithm 
being used (e.g., constant, Gourand, or Phong shading), 
lighting may be evaluated at a number of different locations. 
For example, if constant shading is used (i.e., the lighted 
surface of a polygon is assigned a constant illumination 
value), then the lighting need only be calculated once per 
polygon. If Gourand shading is used, then the lighting is 
calculated once per vertex. Phong shading calculates the 
lighting on a per-sample basis. 

Clipping refers to the elimination of graphics primitives 
or portions of graphics primitives which lie outside of a 3-D 
view volume in world space. The 3-D view volume may 
represent that portion of world space which is visible to a 
virtual observer situated in world space. For example, the 
view volume may be a solid cone generated by a 2-D view 
window and a view point located in world space. The solid 
cone may be imagined as the union of all rays emanating 
from the view r point and passing through the view window. 
The view point may represent the world space location of the 
virtual observer. Primitives or portions of primitives which 
lie outside the 3-D view volume arc not currently visible and 
may be eliminated from further processing. Primitives or 
portions of primitives which lie inside the 3-D view volume 
are candidates for projection onto the 2-D view window. 

In order to simplify the clipping and projection 
computations, primitives may be transformed into a second, 
more convenient, coordinate system referred to herein as the 
viewport coordinate system. In viewport coordinates, the 
view volume maps to a canonical 3-D viewport which may 
be more convenient for clipping against. The term set-up 
refers to this mapping of graphics primitives into viewport 
coordinates. 

Graphics primitives or portions of primitives which sur- 
vive the clipping computation may be projected onto a 2-D 
viewport depending on the results of a visibility determina- 
tion. Instead of clipping in 3-D, graphics primitives may be 
projected onto a 2-D view plane (which includes the 2-D 
viewport) and then clipped with respect to the 2-D viewport. 

Virtual display rendering refers to calculations that are 
performed to generate samples for projected graphics primi- 
tives. For example, the vertices of a triangle in 3-D may be 
projected onto the 2-D viewport. The projected triangle may 
be populated with samples, and values (e.g. red, green, blue 
and z values) may be assigned to the samples based on the 
corresponding values already determined for the projected 
vertices. For example, the red value for each sample in the 
projected triangle may be interpolated from the known red 
values of the vertices. These sample values for the projected 
triangle may be stored in sample buffer 162. Depending 
upon the embodiment, sample buffer 16 also stores a z value 
for each sample. This z-value is stored with the sample for 
a number of reasons, including depth-buffering. As samples 
for successive primitives are rendered, a virtual image 
accumulates in sample buffer 162. Thus, the 2-D viewport is 
said to be a virtual screen on which the virtual image is 
rendered. The sample values comprising the virtual image 
are stored into sample buffer 162. Points in the 2-D viewport 
are described in terms of virtual screen coordinates X and Y, 
and are said to reside in virtual screen space. 
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When the virtual image is complete, e.g., when all graph- 
ics primitives have been rendered, sample-to-pixel calcula- 
tion units 170 may access the samples comprising the virtual 
image and may filter the samples to generate pixel values. In 

5 other words, the sample-to-pixel calculation units 170 may 
perform a spatial convolution of the virtual image with 
respect to a convolution kernel f(X,Y) to generate pixel 
values. For example, a red value for a pixel P may be 
computed at any location (X^Y^) in virtual screen space 

10 based on the relation 

15 where the summation is evaluated at samples (X^,Y^) in the 
neighborhood of location (X^, Y^). Since convolution kernel 
f(X,Y) is non-zero only in a neighborhood of the origin, the 
displaced kernel f(X-X^, Y-Y ) may take non-zero values 
only in a neighborhood of location (X p ,Y p ). The value E is 

20 a normalization value that may be computed according to 
the relation 

E=W k -X p ,Y k -Y p ), 

25 where the summation is evaluated in the same neighborhood 
as above. The summation for the normalization value E may 
be performed in parallel with the summation for the red pixel 
value R p . The location (X p ,Y p ) may be referred to as a pixel 
center or pixel origin. In the case where the convolution 

30 kernel f(X,Y) is symmetric with respect to the origin (0,0), 
the term pixel center maybe used. 

The pixel values may be presented to projection devices 
PD^ through PD^ for display on projection screen SCR. The 
projection devices each generate a portion of integrated 

35 image IMG. Sample-to-pixel calculation units 170 may also 
generate pixel values for display on display device 84. 

In the embodiment of graphics system 112 shown in FIG. 
3, rendering units 150A-D calculate sample values instead 
of pixel values. This allows rendering units 150A-D to 

40 perform super-sampling, i.e. to calculate more than one 
sample per pixel. Super- samp ling in the context of the 
present invention is discussed more thoroughly below. Moie 
details on super-sampling are discussed in the following 
books: "Principles of Digital Image Synthesis" by Andrew 

45 Glassner, 1995, Morgan Kaufman Publishing (Volume 1); 
and "Renderman Companion:" by Steve Upstill, 1990, 
Addison Wesley Publishing. 

Sample buffer 162 may be double-buffered so that ren- 
dering units 150A-D may write samples for a first virtual 

50 image into a first portion of sample buffer 162, while a 
second virtual image is simultaneously read from a second 
portion of sample buffer 162 by sample-to-pixel calculations 
units 170. 

It is noted that the 2-D viewport and the virtual image 
55 which is rendered with samples into sample buffer 162 may 
correspond to an area larger than that area which is physi- 
cally displayed as integrated image IMG or display image 
DIM. For example, the 2-D viewport may include a view- 
able subwindow. The viewable subwindow may correspond 
60 to integrated image IMG and/or display image DIM, while 
the marginal area of the 2-D viewport (outside the viewable 
subwindow) may allow for various effects such as panning 
and zooming. In other words, only that portion of the virtual 
image which lies within the viewable subwindow gets 
65 physically displayed. In one embodiment, the viewable 
subwindow equals the whole of the 2-D viewport. In this 
case, all of the virtual image gets physically displayed. 
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Note that rendering units 150A-D may comprise a num- 
ber of smaller and more specialized functional units, e.g., 
one or more set-up/decompress units and one or more 
lighting units. 

C. Data Memories 

Each of rendering units 150A-D may be coupled to a 
corresponding one of instruction and data memories 
152A-D. In one embodiment, each of memories 152A-D 
may be configured to store both data and instructions for a 
corresponding one of rendering units 150A-D. While imple- 
mentations may vary, in one embodiment, each data memory 
152A-D may comprise two 8 MByte SDRAMs, providing 
a total of 16 MBytes of storage for each rendering unit 
150A-D. In another embodiment, RDRAMs (Rambus 
DRAMs) may be used to support the decompression and 
set-up operations of each rendering unit, while SDRAMs 
may be used to support the draw functions of each rendering 
unit. Data memories 152A-D may also be referred to as 
texture and render memories 152 A-D. 

D. Schedule Unit 

Schedule unit 154 may be coupled between rendering 
units 150A-D and sample memories 160A-N. Schedule unit 
154 is configured to sequence the completed samples and 
store them in sample memories 160A-N. Note in larger 
configurations, multiple schedule units 154 may be used in 
parallel. In one embodiment, schedule unit 154 may be 
implemented as a crossbar switch. 

E. Sample Memories 

Super-sampled sample buffer 162 comprises sample 
memories 160A-160N, which are configured to store the 
plurality of samples generated by rendering units 150 A-D. 
As used herein, the term "sample buffer" refers to one or 
more memories which store samples. As previously noted, 
samples may be filtered to form each output pixel value. 
Output pixel values may be provided to projection devices 
PD 7 through PD^ for display on projection screen SCR. 
Output pixel values may also be provided to display device 
84. Sample buffer 162 may be configured to support super- 
sampling, critical sampling, or sub-sampling with respect to 
pixel resolution. In other words, the average distance 
between samples (X^Y^) in the virtual image (stored in 
sample buffer 162) may be smaller than, equal to, or larger 
than the average distance between pixel centers in virtual 
screen space. Furthermore, because the convolution kernel 
f(X,Y) may take non-zero functional values over a neigh- 
borhood which spans several pixel centers, a single sample 
may contribute to several output pixel values. 

Sample memories 160A-160N may comprise any of 
various types of memories (e.g., SDRAMs, SRAMs, 
RDRAMs, 3DRAMs, or next-generation 3DRAMs) in vary- 
ing sizes. In one embodiment, each schedule unit 154 is 
coupled to four banks of sample memories, wherein each 
bank comprises four 3DRAM-64 memories. Together, the 
3DRAM-64 memories may form a 116-bit deep super- 
sampled sample buffer that stores multiple samples per 
pixel. For example, in one embodiment, each sample 
memory 160A-160N may store up to sixteen samples per 
pixel. 

3DRAM-64 memories are specialized memories config- 
ured to support full internal double buffering with single 
buffered Z in one chip. The double buffered portion com- 
prises two RGBX buffers, wherein X is a fourth channel that 
can be used to store other information (e.g., alpha). 
3DRAM-64 memories also have a lookup table that takes in 
window ID information and controls an internal 2-1 or 3-1 
multiplexer that selects which buffer's contents will be 
output. 3DRAM-64 memories are next-generation 3DRAM 
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memories that may soon be available from Mitsubishi 
Electric Corporation's Semiconductor Group. In one 
embodiment, four chips used in combination are sufficient to 
create a double-buffered 1280x1024 super-sampled sample 
5 buffer. 

Since the 3DRAM-64 memories are internally double- 
buffered, the input pins for each of the two frame buffers in 
the double-buffered system are time multiplexed (using 
multiplexers within the memories). The output pins may 

10 similarly be time multiplexed. This allows reduced pin count 
while still providing the benefits of double buffering. 
3DRAM-64 memories further reduce pin count by not 
having z output pins. Since z comparison and memory buffer 
selection are dealt with internally, use of the 3DRAM-64 

15 memories may simplify the configuration of sample buffer 
162. For example, sample buffer 162 may require little or no 
selection logic on the output side of the 3DRAM-64 memo- 
ries. The 3DRAM-64 memories also reduce memory band- 
width since information may be written into a 3DRAM-64 

20 memory without the traditional process of reading data out, 
performing a z comparison, and then writing data back in. 
Instead, the data may be simply written into the 3DRAM-64 
memory, with the memory performing the steps described 
above internally. 

25 However, in other embodiments of graphics system 112, 
other memories (e.g., SDRAMs, SRAMs, RDRAMs, or 
current generation 3DRAMs) may be used to form sample 
buffer 162. 

Graphics processing unit 90 may be configured to gencr- 

30 ate a plurality of sample positions according to a particular 
sample positioning scheme (e.g., a regular grid, a perturbed 
regular grid, etc.). Alternatively, the sample positions (or 
offsets that are added to regular grid positions to form the 
sample positions) may be read from a sample position 

35 memory (e.g., a RAM/ROM table). Upon receiving a poly- 
gon that is to be rendered, graphics processing unit 90 
determines which samples fall within the polygon based 
upon the sample positions. Graphics processing unit 90 
renders the samples that fall within the polygon and stores 

40 rendered samples in sample memories 160A-N. Note as 
used herein the terms render and draw are used interchange- 
ably and refer to calculating color values for samples. Depth 
values, alpha values, and other per-sample values may also 
be calculated in the rendering or drawing process. 

45 F. Sample-to-pixel Calculation Units 

Sample-to-pixel calculation units 170-1 through 170-V 
(collectively referred to as sample-to-pixel calculation units 
170) may be coupled between sample memories 160A-N 
and DACs 178-1 through 178-L. Sample-to-pixel calcula- 

50 tion units 170 are configured to read selected samples from 
sample memories 160A-N and then perform a convolution 
(i.e. a filtering operation) on the samples to generate the 
output pixel values which are provided to DACs 178-1 
through 178-L. The sample-to-pixel calculation units 170 

55 may be programmable to allow them to perform different 
filter functions at different times, depending upon the type of 
output desired. In one embodiment, the sample-to-pixel 
calculation units 170 may implement a 5x5 super-sample 
reconstruction band-pass filter to convert the super-sampled 

60 sample buffer data (stored in sample memories 160A-N) to 
pixel values. In other embodiments, calculation units 170 
may filter a selected number of samples to calculate an 
output pixel. The selected samples may be multiplied by a 
spatial weighting function that gives weights to samples 

65 based on their position with respect to the center of the pixel 
being calculated. The filtering operation may use any of a 
variety of filters, either alone or in combination. For 
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example, the convolution operation may employ a tent filter, This filtering operation may advantageously improve the 

a circular filter, an elliptic filter, a Mitchell filter, a band pass realism of a displayed image by smoothing abrupt edges in 

filter, a sync function filter, etc. the displayed image (i.e., by performing anti-aliasing). The 

Sample-to-pixel calculation units 170 may also be con- filtering operation may simply average the values of samples 

figured with one or more of the following features: color 5 74A-B to form the corresponding output values of pixel 70, 

look-up using pseudo color tables, direct color, inverse or it may increase the contribution of sample 74B (at the 

gamma correction, filtering of samples to pixels, program- center of pixel 70) and diminish the contribution of sample 

mable gamma encoding, and optionally color space conver- 74A (i.e., the sample farther away from the center of pixel 

sion. Other features of sample-to-pixel calculation units 170 70). The filter, and thus support region 72, is repositioned for 

may include programmable video timing generators, pro- 10 each output pixel being calculated so the center of support 

grammable pixel clock synthesizers, edge -blending region 72 coincides with the center position of the pixel 

functions, hotspot correction functions, color space and being calculated. Other filters and filter positioning schemes 

crossbar functions. Once the sample-to-pixel calculation are also possible and contemplated. 

units have manipulated the timing and color of each pixel, In tne example of FIG. 5A, there are two samples per 

the pixels are output to DACs 178-1 through 178-L. 15 pixel. In general, however, there is no requirement that the 

G. DACs number of samples be related to the number of pixels. The 

Digital-to-Analog Converters (DACs) 178-1 through number of samples may be completely independent of the 

178-L (collectively referred to as DACs 178) operate as the number of pixels. For example, the number of samples may 

final output stage of graphics system 112. DACs 178 trans- be smaller than the number of pixels. (This is the condition 

late digital pixel data received from calculation units 170 20 that defines sub-sampling). 

into analog video signals. Each of DACs 178-1 through Tumi now tQ nG 5fi anothcr cmbodimcnt of 

178-L may be coupled to a corresponding one of projections sampling is illustrated. In this embodiment, the samples are 

devices PD, through J>D L . DAC 178-1 receives a first stream itioned random i y . Thus, the number of samples used to 

of digital pixel data from one or more of calculation units M ^ ^ yalues from ^ {Q ^ 

170 and converts the first stream into a first video signal. 25 Render ^ 15QA _ D ^ information at each 

The first video signal is provided to projection device PD, fe ^ 
Similarly, each of DACs 178-1 through 178-L receive a 

corresponding stream of digital pixel data, and convert the Super-Sampled Sample Buffer with Real-Time 

digital pixel data stream into a corresponding analog video Sample -To -Pixel Calculation — FIGS. 6-10 

signal which is provided to a corresponding one of proiec- 30 _, T _ , .„ „ , „ c . a 

.. j . j-)-p>v ,i , nn FIG. 6 illustrates one possible configuration tor the now 

tion devices PD, through V\) L . +u & u . ... 

XT 4 . u j- imr- 17B uu a of data through one embodiment of graphics system 112. As 

Note in one embodiment DACs 178 may be bypassed or t n t & - • n i • 

... A w t • , . . . A . . + 1 • i a * • the figure shows, geometry data 350 is received by graphics 

omitted completely in order to output digital pixel data in + + ~ , , P , , M , 

r r 1 • j • i T i • i /i • system 112 and used to perform draw process 352. The draw 

lieu of analog video signals. 1 his may be useful proiection - „ M . . , , , ^ , • 

devices PD ; through P Di are based on a digital technology 35 f° cess f 2 ls implemented by one or more of control unit 

(e.g., an LCD-type display or a digital micro-mirror WO rendenng umts 150 data memories 152 and schedule 

j« i \ unit 154. Geometry data 350 comprises data for one or more 
clisnlavi 

polygons. Each polygon comprises a plurality of vertices 

Super-Sampling — FIGS. 4-5 (e.g., three vertices in the case of a triangle), some of which 

FIG. 4 illustrates a portion of virtual screen space in a 40 may be shared amon § multiple polygons. Data such as x, y, 

non-super-sampled example. The dots denote sample and z coordinates, color data, lighting data and texture map 

locations, and the rectangular boxes superimposed on virtual information may be included for each vertex, 

screen space define pixel boundaries. One sample is located In addition to the vertex data, draw process 352 (which 

in the center of each pixel, and values of red, green, blue, z, may be performed by rendering units 150A-D) also receives 

etc. are computed for the sample. For example, sample 74 is 45 sample position information from a sample position memory 

assigned to the center of pixel 70. Although rendering units 354. The sample position information defines the location of 

150 may compute values for only one sample per pixel, samples in virtual screen space, i.e. in the 2-D viewport, 

sample-to-pixel calculation units 170 may still compute Draw process 352 selects the samples that fall within the 

output pixel values based on multiple samples, e.g. by using polygon currently being rendered, calculates a set of values 

a convolution filter whose support spans several pixels. 50 (e.g. red, green, blue, z, alpha, and/or depth of field 

Turning now to FIG. 5A, an example of one embodiment information) for each of these samples based on their 

of super-sampling is illustrated. In this embodiment, two respective positions within the polygon For example, the z 

samples are computed per pixel. The samples are distributed value of a sam ple that falls within a triangle may be 

according to a regular grid. Even through there are more interpolated from the known z values of the three vertices, 

samples than pixels in the figure, output pixel values could 55 Each set of computed sample values are stored into sample 

be computed using one sample per pixel, e.g. by throwing buffer 162. 

out all but the sample nearest to the center of each pixel. In one embodiment, sample position memory 354 is 

However, a number of advantages arise from computing embodied within rendering units 150A-D. In anothcr 

pixel values based on multiple samples. embodiment, sample position memory 354 may be realized 

A support region 72 is superimposed over pixel 70, and 60 a s part of memories 152A-152D, or as a separate memory, 

illustrates the support of a filter which is localized at pixel Sample position memory 354 may store sample positions 

70. The support of a filter is the set of locations over which in terms of their virtual screen coordinates (X,Y). 

the filter (i.e. the filter kernel) takes non-zero values. In this Alternatively, sample position memory 354 may be config- 

example, the support region 72 is a circular disc. The output ured to store only offsets dX and dY for the samples with 

pixel values (e.g. red, green, blue and z values) for pixel 70 65 respect to positions on a regular grid. Storing only the offsets 

are determined only by samples 74A and 74B, because these may use less storage space than storing the entire coordi- 

are the only samples which fall within support region 72. nates (X,Y) for each sample. The sample position informa- 
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tion stored in sample position memory 354 may be read by agonal array, etc., or in an irregular array. Bins may occur in 

a dedicated sample position calculation unit (not shown) and a variety of sizes and shapes. The sizes and shapes may be 

processed to calculate sample positions for graphics pro- programmable. The maximum number of samples that may 

cessing unit 90. More detailed information on the compu- populate a bin is determined by the storage space allocated 

tation of sample positions is included below (see description 5 to the corresponding memory block. This maximum number 

of FIGS. 9 and 10). of samples is referred to herein as the bin sample capacity, 

In another embodiment, sample position memory 354 or simply, the bin capacity. The bin capacity may take any 

may be configured to store a table of random numbers. of a variety of values. The bin capacity value may be 

Sample position memory 354 may also comprise dedicated programmable. Henceforth, the memory blocks in sample 

hardware to generate one or more different types of regular 1Q buffer 162 which correspond to the bins in virtual screen 

grids. This hardware may be programmable. The stored space will be referred to as memory bins, 

random numbers may be added as offsets to the regular grid The specific position of each sample within a bin may be 

positions generated by the hardware. In one embodiment, determined by looking up the sample's offset in the RAM/ 

sample position memory 354 may be programmable to ROM table, i.e. the sample's offset with respect to the bin 

access or "unfold" the random number table in a number of i5 position (e.g. the lower-left corner or center of the bin, etc.). 

different ways, and thus, may deliver more apparent ran- However, depending upon the implementation, not all 

domness for a given length of the random number table. choices for the bin capacity may have a unique set of offsets 

Thus, a smaller table may be used without generating the store d in the RAM/ROM table. Offsets for a first bin 

visual artifacts caused by simple repetition of sample posi- capacity value may be determined by accessing a subset of 

tion offsets. 2q the offsets stored for a second larger bin capacity value. In 

Sample-to-pixel calculation process 360 uses the same one embodiment, each bin capacity value supports at least 

sample positions as draw process 352. Thus , in one four different sample positioning schemes. The use of dif- 

embodiment, sample position memory 354 may generate a ferent sample positioning schemes may reduce final image 

sequence of random offsets to compute sample positions for artifacts due to repeating sample positions, 

draw process 352, and may subsequently regenerate the 25 T n one embodiment, sample position memory 354 may 

same sequence of random offsets to compute the same ' store pairs of 8-bit numbers, each pair comprising an x-offset 

sample positions for sample-to-pixel calculation process an( j a y-offset. (Other offsets are also possible, e.g., a time 

360. In other words, the unfolding of the random number offset, a z-offset, etc.) When added to a bin position, each 

table may be repeatable. Thus, it may not be necessary to pa i r defines a particular position in virtual screen space, i.e. 

store sample positions at the time of their generation for 3Q me 2-D viewport. To improve read access times, sample 

draw process 352. position memory 354 may be constructed in a wide/parallel 

As shown in FIG. 6, sample position memory 354 may be manner so as to allow the memory to output more than one 

configured to store sample offsets generated according to a sample location per read cycle. 

number of different schemes such as a regular square grid, 0 nce the sample positions have been read from sample 

a regular hexagonal grid, a perturbed regular grid, or a 35 position memorv 354, draw process 352 selects the samples 

random (stochastic) distribution. Graphics system 112 may that fall within [ hc polygon currently being rendered. Draw 

receive an indication from the operating system, device process 352 then calculates the z and color information 

driver, or the geometry data 350 that indicates which type of (which may include alpha or other depth of field information 

sample positioning scheme is to be used. Thus the sample values) for each of these samples and stores the data into 

position memory 354 is configurable or programmable to 40 sample buffer 162. In one embodiment, sample buffer 162 

generate position information according to one or more may on i y single-buffer z values (and perhaps alpha values) 

different schemes. More detailed information on several wn ilc double-buffering other sample components such as 

sample positioning schemes are described further below (see co fo r> Unlike prior art systems, graphics system 112 may use 

description of FIG. 8). double-buffering for all samples (although not all compo- 

In one embodiment, sample position memory 354 may 45 nents of samples may be double-buffered, i.e., the samples 

comprise a RAM/ROM that contains stochastically deter- ma y have some components that are not double-buffered). In 

mined sample points or sample offsets. Thus, the density of one embodiment, the samples are stored into sample buffer 

samples in virtual screen space may not be uniform when 162 in bins. In some embodiments, the bin capacity may 

observed at small scale. Two bins with equal area centered vary from frame to frame. In addition, the bin capacity may 

at different locations in virtual screen space may contain 50 vary spatially for bins within a single frame rendered into 

different numbers of samples. As used herein, the term "bin" sample buffer 162. For example, bins on the edge of the 2-D 

refers to a region or area in virtual screen space. viewport may have a smaller bin capacity than bins corre- 

An array of bins may be superimposed over virtual screen sponding to the center of the 2-D viewport. Since viewers 
space, i.e. the 2-D viewport, and the storage of samples in are likely to focus their attention mostly on the center of the 
sample buffer 162 may be organized in terms of bins. The 55 screen SCR or display image DIM, more processing band- 
sample buffer 162 may comprise an array of memory blocks width may be dedicated to providing enhanced image qual- 
which correspond to the bins. Each memory block may store ity in the center of 2-D viewport. Note that the size and 
the sample values (e.g. red, green, blue, z, alpha, etc.) for the shape of bins may also vary from region to region, or from 
samples that fall within the corresponding bin. The appro xi- frame to frame. The use of bins will be described in greater 
mate location of a sample is given by the bin in which it 60 detail below r . 

resides. The memory blocks may have addresses which are In parallel and independently of draw process 352, filter 

easily computable from the corresponding bin locations in process 360 is configured to: (a) read sample positions from 

virtual screen space, and vice versa. Thus, the use of bins sample position memory 354, (b) read corresponding sample 

may simplify the storage and access of sample values in values from sample buffer 162, (c) filter the sample values, 

sample buffer 162. 65 and (d) output the resulting output pixel values to one or 

The bins may tile the 2-D viewport in a regular array, e.g. more of projection devices PD f through PD X and/or display 

in a square array, rectangular array, triangular array, hex- device 84. Sample-to-pixel calculation units 170 implement 
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filter process 360. Filter process 360 is operable to generate 
the red, green, and blue values for an output pixel based a 
spatial filtering of the corresponding data for a selected 
plurality of samples, e.g. samples falling in a neighborhood 
of the pixel center. Other values such as alpha may also be 
generated. In one embodiment, filter process 360 is config- 
ured to: (i) determine the distance of each sample from the 
pixel center; (ii) multiply each sample's attribute values 
(e.g., red, green, blue, alpha) by a filter weight that is a 
specific (programmable) function of the sample's distance; 

(iii) generate sums of the weighted attribute values, one sum 
per attribute (e.g. a sum for red, a sum for green, etc.), and 

(iv) normalize the sums to generate the corresponding pixel 
attribute values. Filter process 360 is described in greater 
detail below (see description accompanying FIGS. 11, 12, 
and 14). 

In the embodiment just described, the filter kernel is a 
function of distance from the pixel center, and thus, is 
radially symmetric. However, in alternative embodiments, 
the filter kernel may be a more general function of X and Y 
displacements from the pixel center. Thus, the support of the 
filter, i.e. the 2-D neighborhood over which the filter kernel 
takes non-zero values, may not be a circular disk. Any 
sample falling within the support of the filter kernel may 
affect the output pixel being computed. 

Turning now to FIG. 7, a diagram illustrating an alternate 
embodiment of graphics system 112 is shown. In this 
embodiment, two or more sample position memories 354A 
and 354B are utilized. Thus, the sample position memories 
354A-B are essentially double-buffered. If the sample posi- 
tions remain the same from frame to frame, then the sample 
positions may be single-buffered. However, if the sample 
positions vary from frame to frame, then graphics system 
112 may be advantageously configured to double-buffer the 
sample positions. The sample positions may be double- 
buffered on the rendering side (i.e., memory 354A may be 
double-buffered) and/or the filter side (i.e., memory 354B 
may be double-buffered). Other combinations are also pos- 
sible. For example, memory 354A may be single -buffered, 
while memory 354B is doubled-buffered. This configuration 
may allow one side of memory 354B to be updated by draw 
process 352 while the other side of memory 354B is 
accessed by filter process 360. In this configuration, graphics 
system 112 may change sample positioning schemes on a 
per-frame basis by shifting the sample positions (or offsets) 
from memory 354A to double-buffered memory 354B as 
each frame is rendered. Thus, the sample positions which are 
stored in memory 354A and used by draw process 352 to 
render sample values may be copied to memory 354B for 
use by filter process 360. Once the sample position infor- 
mation has been copied to memory 354B, position memory 
354A may then be loaded with new sample positions (or 
offsets) to be used for a second frame to be rendered. In this 
way the sample position information follows the sample 
values from the draw 352 process to the filter process 360. 

Yet another alternative embodiment may store tags to 
offsets with the sample values in super-sampled sample 
buffer 162. These tags may be used to look-up the offset (i.e. 
perturbations) dX and dY associated with each particular 
sample. 

Sample Positioning Schemes 

FIG. 8 illustrates a number of different sample positioning 
schemes. In the regular positioning scheme 190, samples are 
positioned at fixed positions with respect to a regular grid 
which is superimposed on the 2-D viewport. For example, 
samples may be positioned at the center of the rectangles 
which are generated by the regular grid. More generally, any 



)6,187 Bl 

20 

tiling of the 2-D viewport may generate a regular positioning 
scheme. For example, the 2-D viewport may be tiled with 
triangles, and thus, samples may be positioned at the centers 
(or vertices) of the triangular tiles. Hexagonal tilings, loga- 

5 rithmic tilings, and semi-regular tilings such as Penrose 
tilings are also contemplated. 

In the perturbed regular positioning scheme 192, sample 
positions are defined in terms of perturbations from a set of 
fixed positions on a regular grid or tiling. In one 

10 embodiment, the samples may be displaced from their 
corresponding fixed grid positions by random x and y 
offsets, or by random angles (ranging from 0 to 360 degrees) 
and random radii (ranging from zero to a maximum radius). 
The offsets may be generated in a number of ways, e.g. by 

15 hardware based upon a small number of seeds, by reading a 
table of stored offsets, or by using a pseudo-random func- 
tion. Once again, perturbed regular gird scheme 192 may be 
based on any type of regular grid or tiling. Samples gener- 
ated by perturbation with respect to a grid (e.g., hexagonal 

20 tiling may particularly desirable due to the geometric prop- 
erties of this configuration). 

Stochastic sample positioning scheme 194 represents a 
third potential type of scheme for positioning samples. 
Stochastic sample positioning involves randomly distribut- 

25 ing the samples across the 2-D viewport. Random position- 
ing of samples may be accomplished through a number of 
different methods, e.g., using a random number generator 
such as an internal clock to generate pseu do -random num- 
bers. Random numbers or positions may also be prc- 

30 calculated and stored in memory. Note, as used in this 
application, random positions may be selected from a sta- 
tistical population (e.g., a Poisson-disk distribution). Differ- 
ent types of random and pseudo-random positions are 
described in greater detail in Chapter 10 of Volume 1 of the 

35 treatise titled "Principles of Digital Image Synthesis" by 
Andrew S. Glassner, Morgan Kaufman Publishers 1995. 

Turning now to FIG. 9, details of one embodiment of 
perturbed regular positioning scheme 192 are shown. In this 
embodiment, samples are randomly offset from a regular 

40 square grid by x- and y-offsets. As the enlarged area shows, 
sample 198 has an x-offset 134 that specifies its horizontal 
displacement from its corresponding grid intersection point 
196. Similarly, sample 198 also has a y-offset 136 that 
specifies its vertical displacement from grid intersection 

45 point 196. The random x-offset 134 and y-offset 136 may be 
limited to a particular range of values. For example, the 
x-offset may be limited to the range from zero to X mGX where 
X mflX is the width of the a grid rectangle. Similarly, the 
y-offset may be limited to the range from zero to Y max may 

50 Y majc is the height of a grid rectangle. The random offset may 
also be specified by an angle and radius with respect to the 
grid intersection point 196. 

FIG. 10 illustrates details of another embodiment of the 
perturbed regular grid scheme 192. In this embodiment, the 

55 samples are grouped into rectangular bins 138A-D. In this 
embodiment, each bin comprises nine samples, i.e. has a bin 
capacity of nine. Different bin capacities may be used in 
other embodiments (e.g., bins storing four samples, 16 
samples, etc.). Each sample's position may be determined 

60 by an x- and y-offset relative to the origin of the bin in which 
it resides. The origin of a bin may be chosen to be the 
lower- left corner of the bin (or any other convenient location 
within the bin). For example, the position of sample 198 is 
determined by summing x-offset 124 and y-offset 126 

65 respectively to the x and y coordinates of the origin 132D of 
bin 138D. As previously noted, this may reduce the size of 
sample position memory 354 used in some embodiments. 
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FIG. 11 — Converting Samples into Pixels not of equal size or width. For example, column one may 
As discussed earlier, the 2-D viewport may be covered contain significantly fewer bins of samples than column 
with an array of spatial bins. Each spatial bin may be four. This embodiment may be particularly useful in con- 
populated with samples whose positions are determined by figurations of the graphics system that support variable 
sample position memory 354. Each spatial bin corresponds 5 sample densities. As previously noted and as described in 
to a memory bin in sample buffer 162. A memory bin stores greater detail below, the graphics system may devote more 
the sample values (e.g. red, green, blue, z, alpha, etc.) for the samples (i.e., a higher sample density) for areas of sample 
samples that reside in the corresponding spatial bin. Sample- buffer 162 that correspond to areas of the final image that 
to-pixel calculation units 170 (also referred to as convolve would benefit the most from higher sample densities, e.g., 
units 170) are configured to read memory bins from sample 10 areas of particular interest to the viewer or areas of the image 
buffer 162 and to convert sample values contained within the that correspond to the viewer's point of foveation (described 
memory bins into pixel values. in greater detail below). In these systems that support 

variable sample densities, the ability to vary the widths of 

Parallel Sample-to-Pixel Filtering using Columns — the columns may advantageously allow the graphics system 

FIGS. 11A-11B 15 to equalize the number of samples filtered by each of the 

i-T^ 11A -n , , J.-I j r • « sample-to-pixel calculation units. For example, column one 

FIG. 11A illustrates one method tor rapidlv converting r r . « , * . 

i , , j . t i cc 1 s~ : , " . i 1 mav correspond to a portion or the displayed image upon 

sample values stored in sample butter 162 into pixel values. i • i i <? * . • • • * i rW 

t,i , i • i-i +1 ry -pi . i which the center or the viewers view point is rocused. Ihus, 

I he spatial bins which cover the 2-D viewport may be t t . , t . , Y . . _ , ' 

• j • * i / f ^ i i a\ \ u 1 the graphics svstem may devote a high density or samples to 

organized into columns (e.g., Cols. 1-4). Each column , °. * . J . . =». J \ 

• f i- • 1 u f +- 1 u- tu 20 the bins in column one, and the graphics system may devote 

comprises a two-dimensional sub -array or spatial bins. 1 he . . . _ ' , - J , ^ r™ 

i i n j , u • * ii i / u a lower density or samples to the bins in column tour. Ihus, 

columns may be configured to horizontally overlap (e.g., by , . \ . , \ c 

. . \ u i * • iii*- by decreasing the width ot column one and increasing the 

one or more bins). Each of the sample-to-pixel calculation y . , . „ f c . . t , , . & . 

i^n 1 *u u a u ^ i j width of column four, sample-to- pixel calculation units 

units 170-1 through 170-4 may be configured to access ~ , A _ n A ' , ^ • . , 

i . r *i 1 i- 170-1 and 170-4 may each filter approximately the same 

memory bins corresponding to one ot the columns. For , . , . ; , . , . . , 

, i ♦ • i i i ♦• •* 1T« i u 25 number or samples. Advantageously, balancing the tittering 

example, sample-to-pixel calculation unit 170-1 may be , , r , . & , /' . ? „ & 

c j + u-*u* load among the samp le-to-pixel calculation units may allow 

configured to access memory bins that correspond to the , , . & , • <- , 

spatial bins of Column 1. The data pathways between the ^ hlcs l ° use the P r ° cessln § ^sources of the 

1 i_ .cp -* /;i . 1 . -iii*- vita sample -to -pixel calculation units in a more efficient manner, 

sample butter 162 and sample-to -pixel calculations unit 170 t t 

may be optimized to support this column-wise correspon- In some embodiments, the graphics system may be con- 

d ence 30 figured to dynamically change the widths of the columns on 

r-™ c Al ii. i 1 1 a frame by frame basis (or even on a fraction of a frame 

The amount of the overlap between columns may depend u . x T J u , > <u u . , iU < u 

±t _ , . i j • . r^i £1i . r .i basis). In embodiments of the graphics system that change 

upon the horizontal diameter of the filter support for the filter \ , ... , ■ n 7 * i- - * * 

! r t t . , ^ t , t • v-t^ 1 1 \ -ii sample densities dynamically (e.g., eye-tracking, point ot 

kernel being used. The example shown in FIG. 11A illus- r . , . J . i r / 1 • \ !t 1 

& . „ 4 r _, , . , foveation tracking, main character tracking), the sample 

trates an overlap ot two bins. Each square (such as square QC , * i r i • .i • 

100X ^ . , t . . . v 3t > densities may vary on a frame bv frame basis, thus varying 

INo) represents a single bin comprising one or more Al . ^ . - / _ , : J ° 

, A 1 .i • ^ ? 11 the column width on a frame by frame basis once again 

samples. Advantageously, this configuration may allow n1 Al r i . • i i i 

r , , . , i w •* i^n* i • j j *i allows the computing resources of sample-to-pixel calcula- 

sample -to -pixel calculation units 170 to work independently 4 . v 5 i m- • r «. . r 4 T 

j • \, , V1 1 r.i 1 * • i i w tlon units 170 to be utilized in a more efficient manner. In 

and in parallel, with each ot the sample -to -pixel calculation , , . JA| , . , 

v ' . , , . r i • i . . . i some embodiments, the column width may be varied on a 

units 170 receiving and convolving samples residing in the „ n t 4l .. i • T jiv 

, . j- 1 ^ 1 - *i 40 scan line basis or some other time -basis. In addition to 

memory bins ot the corresponding column. Overlapping the . V1 A . n -n .11 

, J ■ 1 1 j .1 . • £ . £ varying with time, as the figure illustrates the columns may 

columns will prevent visual bands or other artifacts from , , ^ 1' 1 ✓ - ^ - 11 \ 

. ./ , 1 1 . £ ! also be configured to overlap (as in the previously described 

appearing at the column boundaries tor any operators larger , P. . r • 1 

C . , . embodiment) to prevent the appearance ot any visual arti- 

than a pixel in extent. * w * /• 1 1 • \ 

r tacts (e.g., seams, tears, or vertical lines). 

Furthermore, the embodiment of FIG. 11 A includes a 45 

plurality of bin caches 176 which couple to sample buffer Parallel Sample -to -Pixel Filtering using Rows— 

162. In addition, each of bin caches 176 couples to a FIG. 12 

corresponding one of sample-to-pixel calculation units 170. Turning now to FIG. 12, another embodiment of the 
Generic bin cache 176-1 (where I takes any value positive graphics system is shown. In this embodiment, sample 
integer value) stores a collection of memory bins corre- 50 buffer 162 is divided into a plurality of horizontal rows or 
sponding to Column I and serves as a cache for sample-to- stripes. As with the previous embodiments, the rows may 
pixel calculation unit 170-1. Generic bin cache 176-1 may overlap and/or vary in width to compensate for varying 
have an optimized coupling to sample buffer 162 which sample densities. As with the previous embodiment, each 
facilitates access to the memory bins for Column I. Since the row may provide bins (and samples) to a particular bin cache 
sample-to-pixel calculation for two adjacent output pixels 55 176 and corresponding sample-to-pixel calculation unit 170. 
may involve many of the same bins, bin caches 176 may Parallel Sample-to-Pixel Filtering using Rows— FIG. 13 
increase the overall access bandwidth to sample buffer 162. Turning now to FIG. 13, yet another embodiment of the 
Samplc-to-bin calculation units 170 may be implemented in graphics system is shown. In this embodiment, sample 
a number of different ways, including using high perfor- buffer 162 is divided into a plurality of rectangular regions, 
mance ALU (arithmetic logic unit) cores, functional units 60 As with the previous embodiments, the rectangular regions 
from a microprocessor or DSP, or a custom design that uses ma y or may not overlap, have different sizes, and/or dynami- 
hardware multipliers and adders. cally vary in size (e.g., on a frame by frame or scan line 
Turning now to FIG. 11B, another method for performing basis). Each region may be configured to provide bins (and 
parallel sample-to-pixel calculation is shown. In this samples) to a particular bin cache 176 and corresponding 
embodiment, sample buffer 162 is divided into a plurality of 65 sample-to-pixel calculation unit 170. In some embodiments, 
vertical columns or stripes as in the previously described each rectangular region may correspond to the image pro- 
embodiment. However, the columns in this embodiment are jected by one of a plurality of projectors (e.g., LCD 
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projectors). In other embodiments, each rectangular region 
may correspond to a particular portion of a single image 
projected or displayed on a single display device. As with 
the previous embodiments, advantageously, the sample-to- 
pixel calculation units 170 may be configured to operate 5 
independently and in parallel, thereby reducing the graphics 
systems' latency. As previously noted, the rectangular 
regions illustrated in FIG. 13 need not be of uniform size 
and/or shape. 

In embodiments of the graphics system that have varying 10 
region sizes or stripe widths, the amount of overlap may also 
vary dynamically on a frame by frame or sub -frame basis. 
Note, other shapes for the regions into which sample buffer 
162 may be divided are possible and contemplated. For 
example, in some embodiments each sample-to-pixel calcu- 15 
lation unit may receive bins (and samples) from multiple 
small regions or stripes. 

In some embodiments, sample caches 176 may not have 
enough storage space to store an entire horizontal scan line. 
For this reason dividing the sample buller into regions may 20 
be useful. Depending on the display device, the regions may 
be portions of odd only and even only scan lines. In some 
systems, e.g. those with multiple display devices, each 
region may correspond to a single display device or to a 
quadrant of an image being displayed. For example, assum- 25 
ing the images formed by four projectors are tiled together 
to form a single, large image, then each sample-to-pixel 
calculation unit could receive samples corresponding to 
pixels displayed by a particular projector. In some 
embodiments, the overlapping areas of the regions may be 30 
stored twice, thereby allowing each sample-to-pixel calcu- 
lation unit exclusive access to a particular region of the 
sample buffer. This may prevent timing problems that result 
when two different sample-to-pixel calculation units (or two 
sample cache controllers) attempt to access the same set of 35 
memory locations at the same time. In other embodiments 
the sample buffer may be multi -ported to allow one or more 
multiple concurrent accesses to the same memory locations. 

As previously noted, in some embodiments the sample 
caches are configured to read samples from the sample 40 
buffer. In some embodiments, the samples may be read on a 
bin-by-bin basis from the sample buffer. The sample cache 
and/or sample buffer may include control logic that is 
configured to ensure that all samples that have a potential to 
contribute to one or more pixels that are being filtered (or 45 
that are about to be filtered) are available for the correspond- 
ing sample-to-pixel calculation unit. In some 
implementations, the sample caches may be large enough to 
store a predetermined array of bins such as 5x5 bins (e.g., to 
match the maximum filter size). In another embodiment, 50 
instead of a 5x5 bin cache, the sample caches may be 
configured to output pixels as they are being accumulated to 
a series of multiple accumulators. In this embodiment, a 
different coefficient is generated for each pixel, depending 
upon the number of samples and their weightings. 55 

Method for Reading Samples from Sample 
Buffer — FIG. 14 

Turning now to FIG. 14, more details of one embodiment 
of a method for reading sample values from a super-sampled 60 
sample buffer are shown. As the figure illustrates, the 
sample-to-pixel filter kernel 400 travels across Column I (in 
the direction of arrow 406) to generate output pixel values, 
where index I takes any value in the range from one to four. 
Sample-to-pixel calculation unit 170-1 may implement the 65 
sample-to-pixel filter kernel 400. Bin cache 176-1 may used 
to provide fast access to the memory bins corresponding to 
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Column I. For example, bin cache 176-1 may have a capacity 
greater than or equal to 25 memory bins since the support of 
sample-to-pixel filter kernel 400 covers a 5 by 5 array of 
spatial bins. As the sample-to-pixel operation proceeds, 
memory bins are read from the super-sampled sample buffer 
162 and stored in bin cache 176-1. In one embodiment, bins 
that are no longer needed, e.g. bins 410, are overwritten in 
bin cache 176-1 by new bins. As each output pixel is 
generated, sample-to-pixel filter kernel 400 shifts. Kernel 
400 may be visualized as proceeding in a sequential fashion 
within Column I in the direction indicated by arrow 406. 
When kernel 400 reaches the right boundary 404 of the 
Column I, it may shift down one or more rows of bins, and 
then, proceed horizontally starting from the left column 
boundary 402. Thus the sample-to-pixel operation proceeds 
in a scan line manner generating successive rows of output 
pixels for display. 

FIG. 15 illustrates potential border conditions in the 
computation of output pixel values. The 2-D viewport 420 is 
illustrated as a rectangle which is overlaid with a rectangular 
array of spatial bins. Recall that every spatial bin corre- 
sponds to a memory bin in sample buffer 162. The memory 
bin stores the sample values and/or sample positions for 
samples residing in the corresponding spatial bin. As 
described above, sample-to-pixel calculation units 170 filter 
samples in the neighborhood of a pixel center in order to 
generate output pixel values (e.g. red, green, blue, etc.). 
Pixel center PC 0 is close enough to the lower boundary 
(Y=0) of the 2-D viewport 420 that its filter support 400 is 
not entirely contained in the 2-D viewport. Sample-to-pixel 
calculation units 170 may generate sample positions and/or 
sample values for the marginal portion of filter support 400 
(i.e. the portion which falls outside the 2-D viewport 420) 
according to a variety of methods. 

In one embodiment, sample-to-pixel calculation units 170 
may generate one or more dummy bins to cover the marginal 
area of the filter support 400. Sample positions for the 
dummy bins may be generated by reflecting the sample 
positions of spatial bins across the 2-D viewport boundary. 
For example, dummy bins F, G, H, I and J may be assigned 
sample positions by reflecting the sample positions corre- 
sponding to spatial bins A, B, C, D, and E respectively, 
across the boundary line Y=0. Predetermined color values 
may be associated with these dummy samples in the dummy 
bins. For example, the value (0,0,0) for the RGB color vector 
may be assigned to each dummy sample. As pixel center PC 0 
moves downward (i.e. toward the boundary Y=0 and 
through it), additional dummy bins with dummy samples 
may be generated to cover filter support 400 (which moves 
along with the pixel center PC 0 . The number of dummy 
samples falling within filter support 400 increases and 
reaches a maximum when filter support 400 has moved 
entirely outside of the 2-D viewport 420. Thus, the color 
value computed based on filter support 400 approaches the 
predetermined background color as the pixel center PC 0 
crosses the boundary. 

A pixel center may lie outside of the 2-D viewport 420, 
and yet, may be close enough to the viewport boundary so 
that part of its filter support lies in the 2-D viewport 420. 
Filter support 401 corresponds to one such pixel center. 
Sample-to-pixel calculation units 170 may generate dummy 
bins Q, R, S, T, U and V to cover the external portion of filter 
support 401 (i.e. the portion external to the 2-D viewport). 
The dummy bins Q, R and S may be assigned sample 
positions based on the sample positions of spatial bins N, O 
and P, and/or spatial bins K, L and M. 

The sample positions for dummy bins may also be gen- 
erated by translating the sample positions corresponding to 
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spatial bins across the viewport boundary, or perhaps, by For each sample that is determined to be within the 

generating sample positions on-the-fly according to a triangle, the rendering unit draws the sample by calculating 

regular, a perturbed regular or stochastic sample positioning the sample's color, alpha and other attributes. This may 

scheme. involve a lighting calculation and an interpolation based 

FIG. 16 illustrates an alternate embodiment of a method 5 upon the color and texture map information associated with 

for performing pixel value computations. Sample-to-pixel the vertices of the triangle. Once the sample is rendered, it 

computation units 170 may perform pixel value computa- may be forwarded to schedule unit 154, which then stores 

tions using a viewable subwindow 422 of the 2-D viewport the sample in sample buffer 162 (step 224). 

420. The viewable subwindow is depicted as a rectangle Note the embodiment of the rendering method described 

with lower left corner at (X l9 Y ± ) and upper right corner at 10 a b 0 ve is used for explanatory purposes only and is not meant 

(X 2 ,Y 2 ) in virtual screen space. Note, in some embodiments to be limiting. For example, in some embodiments, the steps 

the filter may be auto-normalized or pre-normalized to shown in FIG. 13 as occurring serially may be implemented 

reduce the number of calculations required for determining i n parallel. Furthermore, some steps may be reduced or 

the final pixel value. eliminated in certain embodiments of the graphics system 

Rendering Samples into a Super-Sampled Sample 15 ( e -S-> ste P s 204-206 in embodiments that do not implement 

B u f£ er pjQ yj geometry compression, or steps 210-212 in embodiments 

™„ . i r , that do not implement a variable resolution super-sampled 

FIG. 17 is a flowchart of one embodiment of a method for sa mole buffer! 
drawing or rendering samples into a super-sampled sample 

buffer. Certain of the steps of FIG. 17 may occur concur- 20 Determination of Which Samples are in Polygon 

rently or in different orders. In step 200, graphics system 112 Being Rendered FIG. 18 

receives graphics commands and graphics data from the host 

CPU 102 or directly from system memory 106. In step 202, The determination of which samples reside within the 

the instructions and data are routed to one or more of polygon being rendered may be performed in a number of 

rendering units 150A-D. In step 204, rendering units 25 different ways. In one embodiment, the deltas between the 

150A-D determine if the graphics data is compressed. If the " three vertices defining the triangle are first determined. For 

graphics data is compressed, rendering units 150A-D example, these deltas may be taken in the order of first to 

decompress the graphics data into a useable format, e.g., second vertex (v2 vl)=dl2, second to third vertex (v3-v2)= 

triangles, as shown in step 206. Next, the triangles are d23 > and third vertex back to the first vertex (vl-v3)=d31. 

processed, e.g., converted from model space to world space, 30 These deltas form vectors, and each vector may be catego- 

lit, and transformed (step 208A). rized as belonging to one of the four quadrants of the 

If the graphics system implements variable resolution coordinate plane (e.g., by using the two sign bits of its delta 

super-sampling, then the triangles are compared with a set of X and Y components). A third condition may be added 

sample-density region boundaries (step 208B). In variable- determining whether the vector is an X-major vector or 

resolution super-sampling, different regions of the 2-D view- 35 Y - ma l or vector - This may be determined by calculating 

port may be allocated different sample densities based upon whether abs(delta_x) is greater than abs(delta_y). Using 

a number of factors (e.g., the center of the attention of an these three bits of information, the vectors may each be 

observer on projection screen SCR as determined by eye or categorized as belonging to one of eight different regions of 

head tracking). Sample density regions are described in the coordinate plane. If three bits are used to define these 

greater detail below (see section entitled Variable Resolution 4 0 re g lons > then the X-sign bit (shifted left by two), the Y-sign 

Sample Buffer below). If the triangle crosses a sample- bit (shifted left by one), and the X-major bit, may be used to 

density region boundary (step 210), then the triangle may be create the eight regions as shown in FIG. 18. 

divided into two smaller polygons along the region bound- Next, three edge inequalities may be used to define the 

ary (step 212). The polygons may be further subdivided into interior of the triangle. The edges themselves may be 

triangles if necessary (since the generic slicing of a triangle 45 described as lines in the either (or both) of the forms 

gives a triangle and a quadrilateral). Thus, each newly y=mx+b or x=ry+c, where rm=l. To reduce the numerical 

formed triangle may be assigned a single sample density. In range needed to express the slope, either the X-major and 

one embodiment, graphics system 112 may be configured to Y-major equation form for an edge equation may be used (so 

render the original triangle twice, i.e. once with each sample that the absolute value of the slope may be in the range of 

density, and then, to clip the two versions to fit into the two 50 0 to 1). Ihus, the edge (or half -plane) inequalities may be 

respective sample density regions. expressed in either of two corresponding forms: 



.Y-major: y-nrx-b<0, when point (x,y) is below the edge; 



In step 214, one of the sample positioning schemes (e.g., 
regular, perturbed regular, or stochastic) is selected from 

sample position memory 354. The sample positioning y- major: x-ry-c<0, when point (x,y) is to the left of the edge, 

scheme will generally have been pre-programmed into the 55 

sample position memory 354, but may also be selected "on The X-major inequality produces a logical true value (i.e. 

the fly". In step 216, rendering units 150A-D determine sign bit equal to one) when the point in question (x,y) is 

which spatial bins may contain samples located within the below the line defined by the an edge. The Y-major equation 

triangle's boundaries, based upon the selected sample posi- produces a logical true value when the point in question 

tioning scheme and the size and shape of the spatial bins. In 60 ( x >y) is to the left of the line defined by an edge. The side 

step 218, the offsets dX and dY for the samples within these which comprises the interior of the triangle is known for 

spatial bins are then read from sample position memory 354. ea ch of the linear inequalities, and may be specified by a 

In step 220, each sample's position is then calculated using Boolean variable referred to herein as the accept bit. Thus, 

the offsets dX and dY and the coordinates of the correspond- a sample (x,y) is on the interior side of an edge if 

ing bin origin, and is compared with the triangle's vertices 65 xmaor- -m-*-6<o<xor>acce t-true- 

to determine if the sample is within the triangle. Step 220 is major. \y mx < <ror>accep me, 

discussed in greater detail below. y-major: (y-m-x-&<o<xor>accept=true; 
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The accept bit for a given edge may be calculated accord- 
ing to the following table based on (a) the region (zero 
through seven) in which the edge delta vector resides, and 
(b) the sense of edge traversal, where clockwise traversal is 
indicated by cw=f and counter-clockwise traversal is indi- 
cated by cw=0. The notation "!" denotes the logical comple- 
ment. 

1: accept=!cw 

0: accept =cw 

4: accept =cw 

5: accept=cw 

7: accept =cw 

6: accept=!cw 

2: accept=!cw 

3: accept=!cw 

Tic breaking rules for this representation may also be 
implemented (e.g., coordinate axes may be defined as 
belonging to the positive octant). Similarly, X-major may be 
defined as owning all points that tie on the slopes. 

In an alternate embodiment, the accept side of an edge 
may be determined by applying the edge inequality to the 
third vertex of the triangle (i.e. the vertex that is not one of 
the two vertices forming the edge). This method may incur 
the additional cost of a multiply-add, which may be avoided 
by the technique described above. 

To determine the "faced-ness" of a triangle (i.e., whether 
the triangle is clockwise or counter-clockwise), the delta- 
directions of two edges of the triangle may be checked and 
the slopes of the two edges may be compared. For example, 
assuming that edgel2 has a delta-direction of 1 and the 
second edge (edge23) has a delta-direction of 0, 4, or 5, then 
the triangle is counter-clockwise. If, however, edge23 has a 
delta-direction of 3, 2, or 6, then the triangle is clockwise. If 
edge23 has a delta -direction of 1 (i.e., the same as edgel2), 
then comparing the slopes of the two edges breaks the tie 
(both are x-major). If edgel2 has a greater slope, then the 
triangle is clockwise. If edge23 has a delta -direction of 7 
(the exact opposite of edgel2), then again the slopes are 
compared, but with opposite results in terms of whether the 
triangle is clockwise or counter-clockwise. 

The same analysis can be exhaustively applied to all 
combinations of edge 12 and edge23 delta-directions, in 
every case determining the proper faced-ness. If the slopes 
are the same in the tie case, then the triangle is degenerate 
(i.e., with no interior area). It can be explicitly tested for and 
culled, or, with proper numerical care, it could be let through 
as it will cause no samples to render. One special case arises 
when a triangle splits the view plane. However, this case 
may be detected earlier in the pipeline (e.g., when front 
plane and back plane clipping are performed). 

Note in most cases only one side of a triangle is rendered. 
Thus, if the faced-ness of a triangle determined by the 
analysis above is the one to be rejected, then the triangle can 
be culled (i.e., subject to no further processing with no 
samples generated). Further note that this determination of 
faced-ness only uses one additional comparison (i.e., of the 
slope of edgel2 to that of edge23) beyond factors already 
computed. Many traditional approaches may utilize more 
complex computations (though at earlier stages of the set-up 
computation). 

Generating Output Pixels Values from Sample 
Values— FIG. 19 

FIG. 19 is a flowchart of one embodiment of a method for 
selecting and filtering samples stored in super-sampled 
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sample buffer 162 to generate output pixel values. In step 
250, a stream of memory bins are read from the super- 
sampled sample buffer 162. In step 252, these memory bins 
may be stored in one or more of bin caches 176 to allow the 

5 sample-to-pixel calculation units 170 easy access to sample 
values during the sample-to-pixel operation. In step 254, the 
memory bins are examined to determine which of the 
memory bins may contain samples that contribute to the 
output pixel value currently being generated. Each sample 

10 that is in a bin that may contribute to the output pixel is then 
individually examined to determine if the sample does 
indeed contribute (steps 256-258). This determination may 
be based upon the distance from the sample to the center of 
the output pixel being generated. 

15 In one embodiment, the sample-to-pixel calculation units 
170 may be configured to calculate this distance (i.e., the 
extent or envelope of the filter at the sample's position) and 
then use it to index into a table storing filter weight values 
according to filter extent (step 260). In another embodiment, 

20 however, the potentially expensive calculation for determin- 
ing the distance from the center of the pixel to the sample 
(which typically involves a square root function) is avoided 
by using distance squared to index into the table of filter 
weights. Alternatively, a function of x and y may be used in 

25 lieu of one dependent upon a distance calculation. In one 
embodiment, this may be accomplished by utilizing a float- 
ing point format for the distance (e.g., four or five bits of 
mantissa and three bits of exponent), thereby allowing much 
of the accuracy to be maintained while compensating for the 

30 increased range in values. In one embodiment, the table may 
be implemented in ROM. However, RAM tables may also 
be used. Advantageously, RAM tables may, in some 
embodiments, allow the graphics system to vary the filter 
coefficients on a per-frame basis. For example, the filter 

35 coefficients may be varied to compensate for known short- 
comings of the display or for the user's personal preferences. 
In some embodiments, the use of RAM tables may allow the 
user to select different filters (e.g., via a sharpness control on 
the display device or in a window system control panel). A 

40 number of different filters may be implemented to generate 
desired levels of sharpness based on different display types. 
For example, the control panel may have one setting opti- 
mized of LCD displays and another setting optimized for 
CRT displays. The graphics system can also vary the filter 

45 coefficients on a screen area basis within a frame, or on a 
per-output pixel basis. Another alternative embodiment may 
actually calculate the desired filter weights for each sample 
using specialized hardware (e.g., multipliers and adders). 
The filter weight for samples outside the limits of the 

50 sample-to-pixel filter may simply be multiplied by a filter 
weight of zero (step 262), or they may be removed from the 
calculation entirely. 

Once the filter weight for a sample has been determined, 
the sample may then be multiplied by its filter weight (step 

55 2 64). The weighted sample may then be summed with a 
running total to determine the final output pixel's color value 
(step 266). The filter weight may also be added to a running 
total pixel filter weight (step 268), which is used to normal- 
ize the filtered pixels. Normalization advantageously pre- 

60 vents the filtered pixels (e.g., pixels with more samples than 
other pixels) from appearing too bright or too dark by 
compensating for gain introduced by the sample-to-pixel 
calculation process. After all the contributing samples have 
been weighted and summed, the total pixel filter weight may 

65 be used to divide out the gain caused by the filtering (step 
270). Finally, the normalized output pixel may be output 
and/or processed through one or more of the following 
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processes (not necessarily in this order): gamma correction, 
color look-up using pseudo color tables, direct color, inverse 
gamma correction, programmable gamma encoding, color 
space conversion, and digital -to -analog conversion, before 
eventually being displayed (step 274). 

In some embodiments, the graphics system may be con- 
figured to use each sample's alpha information to generate 
a mask that output with the sample. The mask may be used 
to perform real-time soft-edged blue screen effects. For 
example, the mask may be used to indicate which portions 
of the rendered image should be masked (and how much). 
This mask could be used by the graphics system or external 
hardware to blend the rendered image with another image 
(e.g., a signal from a video camera) to create a blue screen 
effect that is smooth (anti-aliased with respect to the over- 
lapping regions of the two images) or a ghost effect (e.g., 
superimposing a partially transparent object smoothly over 
another object, scene, or video stream). 

Example Output Pixel Calculation — FIG. 20 

FIG. 20 illustrates a simplified example of an output pixel 
convolution. As the figure shows, four bins 288 A-D contain 
samples that may possibly contribute to the output pixel. In 
this example, the center of the output pixel is located at the 
boundary of bins 288A-288D. Each bin comprises sixteen 
samples, and an array of four bins (2x2) is filtered to 
generate the output pixel. Assuming circular filters are used, 
the distance of each sample from the pixel center determines 
which filter value will be applied to the sample. For 
example, sample 296 is relatively close to the pixel center, 
and thus falls within the region of the filter having a filter 
value of 8. Similarly, samples 294 and 292 fall within the 
regions of the filter having filter values of 4 and 2, respec- 
tively. Sample 290, however, falls outside the maximum 
filter extent, and thus receives a filter value of 0. Thus 
sample 290 will not contribute to the output pixel's value. 
This type of filter ensures that the samples located the closest 
to the pixel center will contribute the most, while pixels 
located farther from the pixel center will contribute less to 
the final output pixel values. This type of filtering automati- 
cally performs anti-aliasing by smoothing any abrupt 
changes in the image (e.g., from a dark line to a light 
background). Another particularly useful type of filter for 
anti-aliasing is a windowed sine filter. Advantageously, the 
windowed sine filter contains negative lobes that resharpen 
some of the blended or "fuzzed" image. Negative lobes are 
areas where the filter causes the samples to subtract from the 
pixel being calculated. In contrast samples on either side of 
the negative lobe add to the pixel being calculated. 

Example values for samples 290-296 are illustrated in 
boxes 300-308. In this example, each sample comprises red, 
green, blue and alpha values, in addition to the sample's 
positional data. Block 310 illustrates the calculation of each 
pixel component value for the no n -normalized output pixel. 
As block 310 indicates, potentially undesirable gain is 
introduced into the final pixel values (i.e., an output pixel 
having a red component value of 2000 is much higher than 
any of the sample's red component values). As previously 
noted, the filter values may be summed to obtain normal- 
ization value 308. Normalization value 308 is used to divide 
out the unwanted gain from the output pixel. Block 312 
illustrates this process and the final normalized example 
pixel values. 

Note the values used herein were chosen for descriptive 
purposes only and are not meant to be limiting. For example, 
the filter may have a large number of regions each with a 
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different filter value. In one embodiment, some regions may 
have negative filter values. The filter utilized may be a 
continuous function that is evaluated for each sample based 
on the sample's distance from the pixel center. Also note that 

5 floating point values may be used for increased precision. A 
variety of filters may be utilized, e.g., box, tent, cylinder, 
cone, Gaussian, Catmull-Rom, Mitchell and Nctravalli, win- 
dowed sine, etc. 

It is also noted that the filter weights need not be powers 

10 of two as show in the figure. The example in the figure is 
simplified for explanatory purposes. A table of filter weights 
may be used (e.g., having a large number of entries which 
are indexed in to based on the distance of the sample from 
the pixel or filter center). Furthermore, in some embodi- 

15 ments each sample in each bin may be summed to form the 
pixel value (although some samples within the bins may 
have a weighting of zero and thus nevertheless contribute 
nothing to the final pixel value). 
Full-Screen Anti-aliasing 

20 The vast majority of current 3D graphics systems only 
provide real-time anti-aliasing for lines and dots. While 
some systems also allow the edge of a polygon to be 
"fuzzed", this technique typically works best when all 
polygons have been pre-sorted in depth. This may defeat the 

25 purpose of having general-purpose 3D rendering hardware 
for most applications (which do not depth pre-sort their 
polygons). In one embodiment, graphics system 112 may be 
configured to implement full-screen anti-aliasing by sto- 
chastically sampling up to sixteen samples per output pixel, 

30 filtered by a 5x5-convolution filter. 

Variable-Resolution Super Sampling — FIGS. 21-25 

Turning now to FIG. 21, a diagram of one possible 

35 scheme for dividing sample buffer 162 is shown. In this 
embodiment, sample buffer 162 is divided into the following 
three nested regions: foveal region 354, medial region 352, 
and peripheral region 350. Each of these regions has a 
rectangular shaped outer border, but the medial and the 

40 peripheral regions have a rectangular shaped hole in their 
center. Each region may be configured with certain constant 
(per frame) properties, e.g., a constant density sample den- 
sity and a constant size of pixel bin. In one embodiment, the 
total density range may be 256, i.e., a region could support 

45 between one sample every 16 screen pixels (4x4) and 16 
samples for every 1 screen pixel. In other embodiments, the 
total density range may be limited to other values, e.g., 64. 
In one embodiment, the sample density varies, either lin- 
early or non-line arly, across a respective region. Note in 

50 other embodiments the display may be divided into a 
plurality of constant sized regions (e.g., squares that are 4x4 
pixels in size or 40x40 pixels in size). 

To simply perform calculations for polygons that encom- 
pass one or more region corners (e.g., a foveal region 

55 corner), the sample buffer may be further divided into a 
plurality of subregions. In FIG. 21, one embodiment of 
sample buffer 162 divided into sub -regions is shown. Each 
of these sub-regions arc rectangular, allowing graphics sys- 
tem 112 to translate from a 2D address with a sub-region to 

60 a linear address in sample buffer 162. Thus, in some embodi- 
ments each sub-region has a memory base address, indicat- 
ing where storage for the pixels within the sub-region starts. 
Each sub-region may also have a "stride" parameter asso- 
ciated with its width. 

65 Another potential division of the super-sampled sample 
buffer is circular. Turning now to FIG. 22, one such embodi- 
ment is illustrated. For example, each region may have two 
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radii associated with it (i.e., 360-368), dividing the region volution process. For example, samples may be ren- 

into three concentric circular-regions. The circular-regions dered into the super-sampling sample buffer 162 using 

may all be centered at the same screen point, the fovea center any of the following methods: 

point. Note however, that the fovea center-point need not 1) a uniform sample density; 

always be located at the center of the foveal region. In some 5 2) varying sample density on a per-region basis (e.g., 

instances it may even be located off-screen (i.e., to the side medial, foveal, and peripheral); and 

of the visual display surface of the display device). While the 3) varying sample density by changing density on a 

embodiment illustrated supports up to seven distinct scan-line basis (or on a small number of scan lines 

circular-regions, it is possible for some of the circles to be basis). 

shared across two different regions, thereby reducing the 1Q Varying sample density on a scan-line basis may be 

distinct circular-regions to five or less. accomplished by using a look-up table of densities. For 

The circular regions may delineate areas of constant example, the table may specify that the first five pixels of a 

sample density actually used. For example, in the example particular scan line have three samples each, while the next 

illustrated in the figure, foveal region 354 may allocate a four pixels have two samples each, and so on. 

sample buffer density of 8 samples per screen pixel, but 15 On the convolution side, the following methods are pos- 

outside the innermost circle 368, it may only use 4 samples sible: 

per pixel, and outside the next circle 366 it may only use two i) a uniform convolution filter; 

samples per pixel. Thus, in this embodiment the rings need 2 ) continuously variable convolution filter; and 

not necessarily save actual memory (the regions do that), but 3) a convolution filter tin at multi le tial fre . 

they may potentially save memory bandwidth into and out of uuencies 

the sample buffer -(as -well as ; pixel convolution bandwidth). A unifonn convolve mter may , for example , have a 

In addition to indicating a different effective sample density, CQnstant extent (of numbef of ^ sekcted) for each 

the rings may also be used to indicate a different sample ^cl calculated In contrast? a continuously variable convo- 

position scheme to be employed. As previously noted, these ^ ^ cfa ^ numbef Qf ^ 

sample position schemes may stored in an on-chip RAM/ 25 used tQ cakulate a ixd ^ e be CQn _ 

ROM, or in programmable memory. tinuously from a maximum at the center of attention to a 

As previously discussed, in some embodiments super- m i n i mum m peripheral areas, 

sampled sample buffer 162 may be further divided into bins. Different combinations of these methods (both on the 

For example, a bin may store a single sample or an array of rcndc ring side and convolution side) arc also possible. For 

samples (e.g., 2x2 or 4x4 samples). In one embodiment, 30 examp i e? a constant sample density may be used on the 

each bin may store between one and sixteen sample points, rendering side, while a continuously variable convolution 

although other configurations are possible and contem- fiher may be used on the samples . 

plated. Each region may be configured with a particular bin Different methods for determining which areas of the 

size, and a constant memory sample density as well. Note image win be allocated more samp i e s per pixel are also 

that the lower density regions need not necessarily have 35 contemplated. In one embodiment, if the image on the 

larger bin sizes. In one embodiment, the regions (or at least screen has a main focal point (e g ? a cnaracter like Mario in 

the inner regions) are exact integer multiples of the bin size a computer game ) 5 then more samples may be calculated for 

enclosing the region. This may allow for more efficient the area around Mario and fewer samples may be calculated 

utilization of the sample buffer in some embodiments. for pixels in omer areas (e g ? around the bac kground or near 

Variable-resolution super-sampling involves calculating a 40 the edges of the screen), 
variable number of samples for each pixel displayed on the T n another embodiment, the viewer's point of foveation 
display device. Certain areas of an image may benefit from may be determined by eye/head/hand-tracking. In head- 
a greater number of samples (e.g., near object edges), while tracking embodiments, the direction of the viewer's gaze is 
other areas may not need extra samples (e.g., smooth areas determined or estimated from the orientation of the viewer's 
having a constant color and brightness). To save memory 45 neac j ? which may be measured using a variety of mecha- 
and bandwidth, extra samples may be used only in areas that nisms> For example, a helmet or visor worn by the viewer 
may benefit from the increased resolution. For example, if ( w i tn eye/head tracking) may be used alone or in combina- 
part of the display is colored a constant color of blue (e.g., ti on ^th a hand-tracking mechanism, wand, or eye-tracking 
as in a background), then extra samples may not be particu- senS or to provide orientation information to graphics system 
larly useful because they will all simply have the constant 50 m. Other alternatives include head-tracking using an infra- 
value (equal to the background color being displayed). In rcd rc fl cc tivc dot placed on the user's forehead, or using a 
contrast, if a second area on the screen is displaying a 3D pa i r 0 f glasses with head- and or eye-tracking sensors built 
rendered object with complex textures and edges, the use of ilL 0 ne method for using head- and hand-tracking is dis- 
additional samples may be useful in avoiding certain arti- closed in U.S. Pat. No. 5,446,834 (entitled "Method and 
facts such as aliasing. A number of different methods may be 55 Apparatus for High Resolution Virtual Reality Systems 
used to determine or predict which areas of an image would Using Head Tracked Display," by Michael Deering, issued 
benefit from higher sample densities. For example, an edge A ug. 29, 1995), which is incorporated herein by reference in 
analysis could be performed on the final image, and with that i ts entirety. Other methods for head tracking are also pos- 
information being used to predict how the sample densities sible and contemplated (e.g., infrared sensors, electromag- 
should be distributed. The software application may also be 60 net ; c sen sors, capacitive sensors, video cameras, sonic and 
able to indicate which areas of a frame should be allocated ultrasonic detectors, clothing based sensors, video tracking 
higher sample densities. devices, conductive ink, strain gauges, force -feedback 

A number of different methods may be used to implement detectors, fiber optic sensors, pneumatic sensors, magnetic 

variable -resolution super sampling. These methods tend to tracking devices, and mechanical switches), 

fall into the following two general categories: 6 5 As previously noted, eye-tracking may be particularly 

(1) those methods that concern the draw or rendering advantageous when used in conjunction with head-tracking, 

process, and (2) those methods that concern the con- In eye -tracked embodiments, the direction of the viewer's 
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gaze is measured directly by detecting the orientation of the 
viewer's eyes in relation to the viewer's head. This 
information, when combined with other information regard- 
ing the position and orientation of the viewer's head in 
relation to the display device, m ay allow an accurate 
measurement of viewer's point of foveation (or points of 
foveation if two eye-tracking sensors are used). One possible 
method for eye tracking is disclosed in U.S. Pat. No. 
5,638,176 (entitled "Inexpensive Interferometric Eye Track- 
ing System"). Other methods for eye tracking are also 
possible and contemplated (e.g., the methods for head track- 
ing listed above). 

Regardless of which method is used, as the viewer's point 
of foveation changes position, so does the distribution of 
samples. For example, if the viewer's gaze is focused on the 
upper left-hand corner of the screen, the pixels correspond- 
ing to the upper left-hand corner of the screen may each be 
allocated eight or sixteen samples, while the pixels in the 
opposite corner (i.e., the lower right-hand corner of the 
screen) may be allocated only one or two samples per pixel. 
Once the viewer's gaze changes, so does the allotment of 
samples per pixel. When the viewer's gaze moves to the 
lower right-hand corner of the screen, the pixels in the upper 
left-hand corner of the screen may be allocated only one or 
two samples per pixel. Thus the number of samples per pixel 
may be actively changed for different regions of the screen 
in relation the viewers point of foveation. Note in some 
embodiments, multiple users may be each have head/eye/ 
hand tracking mechanisms that provide input to graphics 
system 112. In these embodiments, there may conceivably 
be two or more points of foveation on the screen, with 
corresponding areas of high and low sample densities. As 
previously noted, these sample densities may affect the 
render process only, the filter process only, or both pro- 
cesses. 

Turning now to FIGS. 24A-B, one embodiment of a 
method for apportioning the number of samples per pixel is 
shown. The method apportions the number of samples based 
on the location of the pixel relative to one or more points of 
foveation. In FIG. 24A, an eye- or head-tracking device 360 
is used to determine the point of foveation 362 (i.e., the focal 
point of a viewer's gaze). This may be determined by using 
tracking device 360 to determine the direction that the 
viewer's eyes (represented as 364 in the figure) are facing. 
As the figure illustrates, in this embodiment, the pixels are 
divided into foveal region 354 (which may be centered 
around the point of foveation 362), medial region 352, and 
peripheral region 350. 

Three sample pixels are indicated in the figure. Sample 
pixel 374 is located within foveal region 314. Assuming 
foveal region 314 is configured with bins having eight 
samples, and assuming the convolution radius for each pixel 
touches four bins, then a maximum of 32 samples may 
contribute to each pixel. Sample pixel 372 is located within 
medial region 352. Assuming medial region 352 is config- 
ured with bins having four samples, and assuming the 
convolution radius for each pixel touches four bins, then a 
maximum of 16 samples may contribute to each pixel. 
Sample pixel 370 is located within peripheral region 350. 
Assuming peripheral region 370 is configured with bins 
having one sample each, and assuming the convolution 
radius for each pixel touches one bin, then there is a one 
sample to pixel correlation for pixels in peripheral region 
350. Note these values are merely examples and a different 
number of regions, samples per bin, and convolution radius 
may be used. 

Turning now to FIG. 24B, the same example is shown, but 
with a different point of foveation 362. As the figure 



)6,187 Bl 

34 

illustrates, when tracking device 360 detects a change in the 
position of point of foveation 362, it provides input to the 
graphics system, which then adjusts the position of foveal 
region 354 and medial region 352. In some embodiments, 

5 parts of some of the regions (e.g., medial region 352) may 
extend beyond the edge of display device 84. In this 
example, pixel 370 is now within foveal region 354, while 
pixels 372 and 374 are now within the peripheral region. 
Assuming the sample configuration as the example in FIG. 

1Q 24A, a maximum of 32 samples may contribute to pixel 370, 
while only one sample will contribute to pixels 372 and 374. 
Advantageously, this configuration may allocate more 
samples for regions that are near the point of foveation (i.e., 
the focal point of the viewer's gaze). This may provide a 
more realistic image to the viewer without the need to 

15 calculate a large number of samples for every pixel on 
display device 84. 

Turning now to FIGS. 25A-B, another embodiment of a 
computer system configured with a variable resolution 
super-sampled sample buffer is shown. In this embodiment, 

20 the center of the viewer's attention is determined by position 
of a main character 362. Medial and foveal regions are 
centered around main character 362 as it moves around the 
screen. In some embodiments main character may be a 
simple cursor (e.g., as moved by keyboard input or by a 

25 mouse). 

In still another embodiment, regions with higher sample 
density may be centered around the middle of display device 
84' s screen. Advantageously, this may require less control 
software and hardware while still providing a shaper image 

30 in the center of the screen (where the viewer's attention may 
be focused the majority of the time). 

Although the embodiments above have been described in 
considerable detail, other versions are possible. Numerous 
variations and modifications will become apparent to those 

35 skilled in the art once the above disclosure is fully appre- 
ciated. It is intended that the following claims be interpreted 
to embrace all such variations and modifications. Note the 
headings used herein are for organizational purposes only 
and are not meant to limit the description provided herein or 

40 the claims attached hereto. 
What is claimed is: 

1. A graphics system comprising: 

one or more processors configured to receive a set of 
three-dimensional graphics data and render a plurality 
45 of samples based on the graphics data; 

a sample buffer configured to store the plurality of 
samples; and 

a plurality of sample-to-pixel calculation unit, wherein the 
sample-to-pixel calculation units are configured to 

50 receive and filter samples from the sample buffer to 
create output pixels, wherein the output pixels arc 
usable to form an image on a display device, wherein 
each of the plurality of sample-to-pixel calculation 
units are configured to generate pixels corresponding to 

55 a different one of a plurality of regions of the image. 

2. The graphics system as recited in claim 1, wherein the 
processors are configured to receive the three-dimensional 
graphics data in a compressed form, and wherein the pro- 
cessors are configured to decompress the three-dimensional 

60 graphics data before rendering the samples. 

3. The graphics system as recited in claim 1, wherein each 
region corresponds to a different vertical stripe of the image. 

4. The graphics system as recited in claim 1, wherein each 
region comprises portions of the image that correspond to 

65 one or more odd or even scan lines. 

5. The graphics system as recited in claim 1, wherein each 
region comprises a different quadrant of the image. 
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6. The graphics system as recited in claim 1, wherein the 
plurality of regions overlap. 

7. The graphics system as recited in claim 1, wherein the 
display device comprises a plurality of individual display 
devices, and wherein each region corresponds to a single one 
of the plurality of individual display devices. 

8. The graphics system as recited in claim 1, wherein the 
display device comprises a plurality of individual display 
devices, and wherein each region corresponds to a different 
one of the plurality of individual display devices. 

9. The graphics system as recited in claim 1, wherein the 
regions vary in dimension on a frame -by-frame basis. 

10. The graphics system as recited in claim 1, wherein the 
regions of the image vary in dimension on a frame -by-frame 
basis to balance the number of samples filtered by each of 
the sample-to-pixel calculation units. 

11. The graphics system as recited in claim 1, wherein 
each region is a different horizontal stripe of the image. 

12. The graphics system as recited in claim 1, wherein 
each region is a different rectangular portion of the image. 

13. The graphics system as recited in claim 1, wherein 
each sample comprises color components, and wherein the 
sample-to-pixel calculation units are configured to: 

determine which samples are within a predetermined filter 
envelope; 

multiply the samples within the predetermined filter enve- 
lope by one or more un-normalized weighting factors, 
wherein the weighting factors vary in relation to the 
sample's position relative to the center of the filter 
envelope; and 

normalize the resulting output pixels. 

14. The graphics system as recited in claim 1, wherein 
each sample comprises color components, and wherein the 
sample-to-pixel calculation units are configured to deter- 
mine which samples are within a predetermined filter 
envelope, and multiply the samples within the predeter- 
mined filter envelope by one or more normalized weighting 
factors, wherein the weighting factors vary in relation to the 
sample 's position relative to the center of the filter envelope. 

15. The graphics system as recited in claim 1, wherein 
each sample comprises an alpha component. 

16. The graphics system as recited in claim 1, wherein 
each sample comprises a blur component. 

17. The graphics system as recited in claim 1, wherein 
each sample comprises a transparency component. 

18. The graphics system as recited in claim 1, wherein 
each sample comprises a z-component. 

19. The graphics system as recited in claim 1, wherein the 
samples stored in the sample buffer are double buffered. 

20. The graphics system as recited in claim 1, wherein the 
samples stored in the sample buffer are stored in bins. 

21. The graphics system as recited in claim 1, further 
comprising the display device. 

22. A method for rendering a set of three-dimensional 
graphics data, the method comprising: 

receiving the three-dimensional graphics data; 
generating one or more samples based on the graphics 
data; 

storing the samples; 

dividing the samples into a plurality of regions; 
selecting stored samples from the plurality of regions; and 
filtering the selected samples to form a plurality of output 

pixels in parallel, wherein the output pixels are usable 

to form an image on a display device. 

23. The method as recited in claim 22, wherein the 
plurality of regions comprise portions of one or more odd or 
even scan lines of the image. 
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24. The method as recited in claim 22, wherein each 
region comprises a quadrant of the image. 

25. The method as recited in claim 22, wherein each 
region comprises a vertical stripe of the image. 

5 26. The method as recited in claim 22, wherein each 
region corresponds to a horizontal stripe of the image. 

27. The method as recited in claim 22, wherein the display 
device comprises a plurality of individual display devices, 
and wherein each region corresponds to a single one of the 

10 plurality of individual display devices. 

28. The method as recited in claim 22, wherein the display 
device comprises a plurality of individual display devices, 
and wherein each region corresponds to a different one of the 
plurality of individual display devices. 

15 29. The method as recited in claim 22, wherein each 
region comprises a vertical stripe of the image. 

30. The method as recited in claim 22, wherein the regions 
of the image overlap. 

31. The method as recited in claim 22, wherein the 
20 boundaries of the regions change over time to balance the 

number of samples in each vertical column. 

32. The method as recited in claim 22, wherein the 
number of samples filtered per pixel varies across the image. 

33. The method as recited in claim 22, wherein the 
25 boundaries of the regions change over time to equalized the 

number of samples in each region. 

34. The method as recited in claim 22, wherein the 
three-dimensional graphics data is received in compressed 
form, and wherein the method further comprises: dccom- 

30 pressing the compressed three-dimensional graphics data. 

35. The method as recited in claim 22, wherein each 
sample comprises color components, and wherein said fil- 
tering comprises: 

determining which samples are within a predetermined 
35 filter envelope; 

multiplying the samples within the predetermined filter 
envelope by one or more un-normalized weighting 
factors, wherein said weighting factors vary in relation 
to the sample's position relative to the center of the 
40 filter envelope; 

summing the weighted samples to form an output pixel; 
and 

normalizing the output pixel. 
45 36. The method as recited in claim 22, wherein each 
sample comprises color components, and wherein said fil- 
tering comprises: 

determining which samples are within a predetermined 
filter envelope; 

50 multiplying the samples within the predetermined filter 
envelope by one or more normalized weighting factors, 
wherein the weighting factors vary in relation to the 
sample's position relative to the center of the filter 
envelope; and 

55 summing the weighted samples to form an output pixel. 

37. The method as recited in claim 22, wherein the 
samples are stored in bins. 

38. A computer system comprising: 

6Q a means for receiving a set of three-dimensional graphics 
data; 

a means for rendering a plurality of samples based on the 
set of three-dimensional graphics data; 

a means for storing the rendered samples; and 
65 a plurality of filtering means configured to filter stored 
samples to create output pixels, wherein the output 
pixels are usable to form an image on a display device, 
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and wherein each of the plurality of filtering means are 
configured to generate pixels corresponding to one of a 
plurality of different regions of the image. 

39. The computer system as recited in claim 38, wherein 
each region corresponds to a different vertical stripe of the 5 
image. 

40. The computer system as recited in claim 38, wherein 
the regions overlap. 

41. The computer system as recited in claim 38, wherein 
the regions vary in size over time. 10 

42. The computer system as recited in claim 38, wherein 
each region varies in size on a frame-by-frame basis to 
balance the number of samples filtered by each of the 
filtering means. 

43. The computer system as recited in claim 38, wherein 15 
each region corresponds to a different horizontal stripe of the 
image. 

44. The computer system as recited in claim 38, wherein 
each region corresponds to a different rectangular portion of 
the image. 20 

45. The computer system as recited in claim 38, wherein 
each region corresponds to a different quadrant of the image. 

46. The method as recited in claim 38, wherein one or 
more of the plurality of regions comprises portions of the 
image corresponding to one or more odd scan lines of the 25 
image, and wherein one or more of the plurality of regions 
comprises portions of the image corresponding to one or 
more even scan lines of the image. 
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47. The computer system as recited in claim 38, wherein 
the size of each portion of the image varies on a frame-by- 
frame basis to balance the number of samples filtered by 
each of the filtering means. 

48. The computer system as recited in claim 38, wherein 
each sample comprises color components, and wherein the 
filtering means arc configured to determine which samples 
are within a predetermined filter envelope, multiply the 
samples within the predetermined filter envelope by one or 
more un-normalized weighting factors, wherein the weight- 
ing factors varies in relation to the sample's position relative 
to the center of the filter envelope, and normalize the 
resulting output pixels. 

49. The computer system as recited in claim 38, wherein 
each sample comprises color components, and wherein the 
filtering means are configured to determine which samples 
are within a predetermined filter envelope, multiply the 
samples within the predetermined filter envelope by one or 
more normalized weighting factors to form one or more 
output pixels. 

50. The computer system as recited in claim 38, wherein 
each sample further comprises an alpha component. 

51. The computer system as recited in claim 38, wherein 
the samples stored in the sample buffer are double buffered. 

52. The computer system as recited in claim 38, wherein 
the samples stored in the sample buffer are stored in bins. 

* * * * # 
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ABSTRACT 



An apparatus to determine a transform of a block of encoded 
data the block of encoded data comprising a plurality of data 
elements. An input register is configured to receive a pre- 
determined quantity of data elements. At least one butterfly 
processor is coupled to the input register and is configured 
to perform at least one mathematical operation on selected 
pairs of data elements to produce an output of processed data 
elements. At least one intermediate register is coupled to the 
butterfly processor and configured to temporarily store the 
processed data. A feedback loop is coupled to the interme- 
diate register and the butterfly processor, and where if 
enabled, is configured to transfer a first portion of processed 
data elements to the appropriate butterfly processor to per- 
form additional mathematical operations and where if 
disabled, is configured to transfer a second portion of 
processed data elements to at least one holding register. 

91 Claims, 12 Drawing Sheets 
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APPARATUS AND METHOD FOR 
ENCODING AND COMPUTING A DISCRETE 
COSINE TRANSFORM USING A 
BUTTERFLY PROCESSOR 

5 

This application claims the benefit of priority of the U.S. 
Provisional Patent Application Ser. No. 60/291,467, filed 
May 16, 2001, which is incorporated herein by reference in 
its entirety. 

10 

BACKGROUND OF THE INVENTION 

I. Field of the Invention 

The present invention relates to digital signal processing. 
More specifically, the present invention relates to an appa- 
ratus and method for determining the transform of a block of 15 
encoded data. 

II. Description of the Related Art 

Digital picture processing has a prominent position in the 
general discipline of digital signal processing. The impor- 2Q 
tance of human visual perception has encouraged tremen- 
dous interest and advances in the art and science of digital 
picture processing. In the field of transmission and reception 
of video signals, such as those used for projecting films or 
movies, various improvements are being made to image 25 
compression techniques. Many of the current and proposed 
video systems make use of digital encoding techniques. 
Aspects of this field include image coding, image 
restoration, and image feature selection. Image coding rep- 
resents the attempts to transmit pictures of digital commu- 3Q 
nication channels in an efficient manner, making use of as 
few bits as possible to minimize the band width required, 
while at the same time, maintaining distortions within cer- 
tain limits. Image restoration represents efforts to recover the 
true image of the object. The coded image being transmitted 35 
over a communication channel may have been distorted by 
various factors. Source of degradation may have arisen 
originally in creating the image from the object. Feature 
selection refers to the selection of certain attributes of the 
picture. Such attributes may be required in the recognition, 4Q 
classification, and decision in a wider context. 

Digital encoding of video, such as that in digital cinema, 
is an area which benefits from improved image compression 
techniques. Digital image compression may be generally 
classified into two categories: loss-less and lossy methods. A 45 
loss-less image is recovered without any loss of information. 
A lossy method involves an irrecoverable loss of some 
information, depending upon the compression ratio, the 
quality of the compression algorithm, and the implementa- 
tion of the algorithm. Generally, lossy compression 50 
approaches are considered to obtain the compression ratios 
desired for a cost-effective digital cinema approach. To 
achieve digital cinema quality levels, the compression 
approach should provide a visually loss-less level of per- 
formance. As such, although there is a mathematical loss of 55 
information as a result of the compression process, the 
image distortion caused by this loss should be imperceptible 
to a viewer under normal viewing conditions. 

Existing digital image compression technologies have 
been developed for other applications, namely for television 60 
systems. Such technologies have made design compromises 
appropriate for the intended application, but do not meet the 
quality requirements needed for cinema presentation. 

Digital cinema compression technology should provide 
the visual quality that a moviegoer has previously experi- 65 
enced. Ideally, the visual quality of digital cinema should 
attempt to exceed that of a high-quality release print film. At 
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the same time, the compression technique should have high 
coding efficiency to be practical. As defined herein, coding 
efficiency refers to the bit rate needed for the compressed 
image quality to meet a certain qualitative level. Moreover, 
the system and coding technique should have built-in flex- 
ibility to accommodate different formats and should be cost 
effective; that is, a small-sized and efficient decoder or 
encoder process. 

One compression technique capable of offering signifi- 
cant levels of compression while preserving the desired level 
of quality utilizes adaptively sized blocks and sub-blocks of 
encoded Discrete Cosine Transform (DCT) coefficient data. 
Although DCT techniques are gaining wide acceptance as a 
digital compression method, efficient hardware implemen- 
tation has been difficult. 

SUMMARY OF THE INVENTION 

The invention provides for efficient hardware implemen- 
tation of adaptive block sized DCT encoded data. An appa- 
ratus to determine a transform of a block of encoded data the 
block of encoded data comprising a plurality of data ele- 
ments. An input register is configured to receive a predeter- 
mined quantity of data elements. At least one butterfly 
processor is coupled to the input register and is configured 
to perform at least one mathematical operation on selected 
pairs of data elements to produce an output of processed data 
elements. At least one intermediate register is coupled to the 
butterfly processor and configured to temporarily store the 
processed data. A feedback loop is coupled to the interme- 
diate register and the butterfly processor, and where if 
enabled, is configured to transfer a first portion of processed 
data elements to the appropriate butterfly processor to per- 
form additional mathematical operations and where if 
disabled, is configured to transfer a second portion of 
processed data elements to at least one holding register. 

Accordingly, it is an aspect of an embodiment to provide 
a processor that efficiently implements discrete cosine trans- 
form (DCT) and discrete quadtree transform (DQT) tech- 
niques. 

It is another aspect of an embodiment to provide a 
processor that efficiently implements inverse discrete cosine 
transform (IDCT) and inverse discrete quadtree transform 
(IDQT) techniques. 

It is another aspect of an embodiment to implement a 
processor that is flexible in that the same hardware compo- 
nents may be reconfigured to compute different mathemati- 
cal operations within the same transform trellis. 

It is another aspect of an embodiment to provide an image 
processor that maintains a high quality image while mini- 
mizing image distortion. 

It is another aspect of an embodiment to process portions 
of encoded data in parallel. 

It is another aspect of an embodiment to process read, 
write, and butterfly operations in a single clock cycle. 

It is another aspect of an embodiment to provide and 
implement a control sequencer having the variability to 
control different block sizes of data and maintain the speed 
necessary for real-time processing. 

It is another aspect of an embodiment to implement a 
processor such that the processor is configurable to operate 
on variable block sizes. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The aspects, features, objects, and advantages of the 
invention will become more apparent from the detailed 
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description set forth below when taken in conjunction with The reduction and computational complexity of these algo- 
the drawings in which like reference characters identify rithms and its recursive structure results in a more simplified 
correspondingly throughout and wherein: hardware scheme. DCTs are generally orthogonal and sepa- 
FIG. 1 is a block diagram of column and row processing rable - The fact tha * DCTs are orthogonal implies that the 
of a block of data* 5 ener gy> or information, of a signal is preserved under trans- 
it ^ i • i-i -i j • -ii . + . + , r, r A , formation; that is, mapping into the DCT domain. The fact 
MCj. 2a is a block diagram illustrating the now or data , ' ' \\ ? , 1 . .,. 
, uu i. that DCTs are separable implies that a multidimensional 
through an encoding process; _ f , , r . r 1 . . 

DCT may be implemented by a series or one -dimensional 

FIG. 2b is a flow diagram illustrating the flow of data transforms . Accordingly, faster algorithms may be devel- 

through a decoding process; 1Q Qped for one . dimensiona i DCTs and be directly extended to 

FIG. 2c is a block diagram illustrating the processing multidimensional transforms, 

steps involved in variance based block size assignment; In a DC T, a block of pixels is transformed into a same-size 

FIG. 3 is a block diagram illustrating an apparatus to block of coefficients in the frequency domain. Essentially, 

compute a transform, such as a discrete cosine transform the transform expresses a block of pixels as a linear com- 

(DCT) and a discrete quantization transform (DQT), 15 bin ation of orthogonal basis images. The magnitudes of the 

embodying the invention; coefficients express the extent to which the block of pixels 

FIG. 4 illustrates a DCT trellis that is implemented by the and the basis images are similar, 

apparatus of FIG. 3; Generally, an image to be processed in the digital domain 

FIG. 5 illustrates an IDCT trellis that is implemented by is composed of pixel data divided into an array of non- 

the apparatus of FIG. 3; 20 overlapping blocks, NxN in size. A two-dimensional DCT 

T7T^ a -ii i- i- - 1 u ^ a vi • , may be performed on each block. The two-dimensional DCT 

Hb. 6 illustrates a single butterfly processor with input -in,, i r n 

i a a 1 1- * 1 is defined by the following relationship: 

and output multiplexers; J te r 

FIG. 7 illustrates a block diagram of a write multiplexer; „ , „ , 

„^ _ .„ _ t _ a{k)B{l)%A\A r(2m + l>rjfel r(2m + l>/] 

FIG. 8 illustrates a block diagram or a butterfly processor; 25 X(k,l)= — - — Zj x ( m > ") cos — — cos — 2N — ' 

FIG. 9a illustrates a No Operation configuration that may 

be performed by butterfly processor of FIG. 8; 0 < k, l < N - 1 

FIG. 9b illustrates an Accumulate Operation configuration 

that may be performed by butterfly processor of FIG. 8; where 

FIG. 9c illustrates a butterfly DCT Operation configura- 
tion that may be performed by butterfly processor of FIG. 8; ( l if A = o 

FIG. 9d illustrates a Butterfly IDCT Operation configu- *(*)■ #*) = | ^ if ^ 0 ' 
ration that may be performed by butterfly processor of FIG. 

8 ; 35 

FIG. 9e illustrates an Accumulate Register Operation an d 

configuration that may be performed by butterfly processor x(m,n) is the pixel location (m,n) within an NxM block, 

of FIG. 8; and 

FIG. 9f illustrates a DQT/IDQT Operation configuration X(k,l) is the corresponding DCT coefficient, 

that may be performed by butterfly processor of FIG. 8; 40 Since pixel values are non-negative, the DCT component 

FIG. 10 illustrates a flowchart showing the process of X(0,0) is always positive and usually has the most energy. In 

calculating a transform, such as a discrete cosine transform fact > for tyP ical images, most of the transform energy is 

(DCT) and a discrete quantization transform (DQT), concentrated around the component X(0,0). This energy 

embodying the invention; compaction property makes the DCT technique such an 

t-t^ 11 -ii 4. 4. 1 ui i • + 45 attractive compression method. 

Hb. 11a illustrates an exemplary block size assignment; T . . \ . . . . . 

,„ . 1# , , It has been observed that most natural images are made up 

FIG. lib illustrates the corresponding quad-tree decom- of ^ relativel slow { and b areas such as 

position for the block size assignment of FIG. 11a; and object boundaries and high . contrast texture . Contrast adap- 
FIG. 11c illustrates a corresponding PQR data for the tive coding schemes take advantage of this factor by assign- 
block size assignment of FIG. 11a. 50 j n g more b j ts to me busy areas and f ewer bits to the less busy 
DETAILED DESCRIPTION OF THE areas. This technique is disclosed in U.S. Pat. No. 5,021,891, 
PREFERRED EMBODIMENTS entitled "Adaptive Block Size Image Compression Method 
' ~ and System," assigned to the assignee of the present inven- 
In order to facilitate digital transmission of digital signals tion and incorporated herein by reference. DCT techniques 
and enjoy the corresponding benefits, it is generally neces- 55 are also disclosed in U.S. Pat. No. 5,107,345, entitled 
sary to employ some form of signal compression. To achieve "Adaptive Block Size Image Compression Method And 
high definition in a resulting image, it is also important that System," assigned to the assignee of the present invention 
the high quality of the image be maintained. Furthermore, and incorporated herein by reference. Further, the use of the 
computational efficiency is desired for compact hardware ABSDCT technique in combination with a Differential 
implementation, which is important in many applications. 60 Quadtree Transform technique is discussed in U.S. Pat. No. 

Accordingly, spatial frequency-domain techniques, such 5,452,104, entitled "Adaptive Block Size Image Compres- 

as Fourier transforms, wavelet, and discrete cosine trans- sion Method And System," also assigned to the assignee of 

forms (DCT) generally satisfy the above criteria. The DCT the present invention and incorporated herein by reference, 

has energy packing capabilities and approaches a statistical The systems disclosed in these patents utilizes what is 

optimal transform in deco reflating a signal. The develop- 65 referred to as "intra-frame" encoding, where each frame of 

ment of various algorithms for the efficient implementation image data is encoded without regard to the content of any 

of DCT further contributes to its mainstream applicability. other frame. Using the ABSDCT technique, the achievable 
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data rate may be greatly reduced without discernible deg- 
radation of the image quality. 

Using ABSDCT, a video signal will generally be seg- 
mented into frames and blocks of pixels for processing. The 
DCT operator is one method of converting a time-sampled 
signal to a frequency representation of the same signal. By 
converting to a frequency representation, DCT techniques 
have been shown to allow for very high levels of 
compression, as quantizers can be designed to take advan- 
tage of the frequency distribution characteristics of an 
image. In a preferred embodiment, one 16x16 DCT is 
applied to a first ordering, four 8x8 DCTs are applied to a 
second ordering, 16 4x4 DCTs are applied to a third 
ordering, and 64 2x2 DCTs are applied to a fourth ordering. 

For image processing purposes, the DCT operation is 
performed on pixel data that is divided into an array of 
non-overlapping blocks. Note that although block sizes are 
discussed herein as being NxN in size, it is envisioned that 
various block sizes may be used. For example, an NxM 
block size may be utilized where both N and M are integers 
with M being either greater than or less than N. Another 
important aspect is that the block is divisible into at least one 
level of sub-blocks, such as N/ixN/i, N/ixN/j, N/ixM/j, and 
etc. where i and j are integers. Furthermore, the exemplary 
block size as discussed herein is a 16x16 pixel block with 
corresponding block and sub-blocks of DCT coefficients. It 
is further envisioned that various other integers such as both 
even or odd integer values may be used, e.g., 9x9. 

A color signal may be converted from RGB space to 
YC 1 C 2 space, with Y being the luminance, or brightness, 
component, and C ± and C 2 being the chrominance, or color, 
components. Because of the low spatial sensitivity of the eye 
to color, many systems sub -sample the C 1 and C 2 compo- 
nents by a factor of four in the horizontal and vertical 
directions. However, the sub -sampling is not necessary. A 
full resolution image, known as 4:4:4 format, may be either 
very useful or necessary in some applications such as those 
referred to as covering digital cinema. Two possible YC}^ 
representations are, the YIQ representation and the YUV 
representation, both of which are well known in the art. It is 
also possible to employ a variation of the YUV representa- 
tion known as YCbCr. 

FIGS, la and lb illustrate column and row processing of 
a NxN block of encoded data 100 and 120. An N dimen- 
sional transform may be performed as a cascade of N 
one-dimensional transforms. For example, a 2x2 DCT is 
performed as a cascade of two one-dimensional DCT 
processes, first operating on each column and then operating 
on each row. A first column m (104) is processed, followed 
by column m+1 (108), followed by column m+2 (112), and 
so on through column n (116). After the columns are 
processed, the rows 120 are processed as illustrated in FIG 
lb. First, row m (124) is processed, followed by row m+1 
(128), row m+2 (132) and so on through row n (136). 

Similarly, another example may be an 8x8 block of data 
needing IDCT processing. The 8x8 block may be broken 
into four two-dimensional IDCTs. Each two-dimensional 
IDCT may then be processed in the same manner with 
respect to the two-dimensional DCT described with respect 
to FIGS, la and lb. 

FIG. 2a illustrates a block diagram 250 of the flow of 
encoded data during an encoding process. In the encoding 
process, encoded data is transformed from the pixel domain 
to the frequency domain. FIG. 2b illustrates a block diagram 
254 of the flow of encoded data through a decoding process. 
In the decoding process, encoded data is transformed from 
the frequency domain to the pixel domain. As illustrated in 
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the encode process 250, a block sized assignment (BSA) of 
the encoded data is first performed (258). In an aspect of an 
embodiment, each of the Y, Cb, and Cr components is 
processed without sub-sampling. Thus, an input of a 16x16 
block of pixels is provided to the block size assignment 
element 258, which performs block size assignment in 
preparation for video compression. 

The block size assignment element 258 determines the 
block decomposition of a block based on the perceptual 
characteristics of the image in the block. Block size assign- 
ment subdivides each 16x16 block into smaller blocks in a 
quad-tree fashion depending on the activity within a 16x16 
block. The block size assignment element 258 generates a 
quad-tree data, called the PQR data, whose length can be 
between 1 and 21 bits. Thus, if block size assignment 
determines that a 16x16 block is to be divided, the R bit of 
the PQR data is set and is followed by four additional bits 
of Q data corresponding to the four divided 8x8 blocks. If 
block size assignment determines that any of the 8x8 blocks 
is to be subdivided, then four additional bits of P data for 
each 8x8 block subdivided are added. 

Data is divided into block sizes, such as 2x2, 4x4, 8x8, 
and 16x16. An encode data processor then performs a 
transform (DCT/DQT) of the encoded data (262), as is 
described with respect to FIG. 3. After the DCT/DQT 
process 262 is completed, a quantization process (QB) 266 
is performed on the encoded data. This completes transfor- 
mation of encoded data from the pixel domain to the 
frequency domain. 

In an embodiment, the DCT coefficients are quantized 
using frequency weighting masks (FWMs) and a quantiza- 
tion scale factor. A FWM is a table of frequency weights of 
the same dimensions as the block of input DCT coefficients. 
The frequency weights apply different weights to the dif- 
ferent DCT coefficients. The weights are designed to empha- 
size the input samples having frequency content that the 
human visual system is more sensitive to, and to 
de-emphasize samples having frequency content that the 
visual system is less sensitive to. The weights may also be 
designed based on factors such as viewing distances, etc. 

Huffman codes are designed from either the measured or 
theoretical statistics of an image. It has been observed that 
most natural images are made up of blank or relatively 
slowly varying areas, and busy areas such as object bound- 
aries and high-contrast texture. Huffman coders with 
frequency-domain transforms such as the DCT exploit these 
features by assigning more bits to the busy areas and fewer 
bits to the blank areas. In general, Huffman coders make use 
of look-up tables to code the run-length and the non-zero 
values. 

The weights are selected based on empirical data. A 
method for designing the weighting masks for 8x8 DCT 
coefficients is disclosed in ISO/IEC JTC1 CD 10918, "Digi- 
tal compression and encoding of continuous-tone still 
images — part 1: Requirements and guidelines," Interna- 
tional Standards Organization, 1994, which is herein incor- 
porated by reference. In general, two FWMs are designed, 
one for the luminance component and one for the chromi- 
nance components. The FWM tables for block sizes 2x2, 
4x4 are obtained by decimation and 16x16 by interpolation 
of that for the 8x8 block. The scale factor controls the 
quality and bit rate of the quantized coefficients. 

Thus, each DCT coefficient is quantized according to the 
relationship: 
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where DCT(i,j) is the input DCT coefficient, fwm(i,j) is the 
frequency weighting mask, q is the scale factor, and DCTq 
(i,j) is the quantized coefficient. Note that depending on the 
sign of the DCT coefficient, the first term inside the braces 
is rounded up or down. The DQT coefficients are also 
quantized using a suitable weighting mask. However, mul- 
tiple tables or masks can be used, and applied to each of the 
Y, Cb, and Cr components. 

The quantized coefficients are provided to a zigzag scan 
serializer 268. The serializer 268 scans the blocks of quan- 
tized coefficients in a zigzag fashion to produce a serialized 
stream of quantized coefficients. A number of different 
zigzag scanning patterns, as well as patterns other than 
zigzag may also be chosen. A preferred technique employs 
8x8 block sizes for the zigzag scanning, although other 
sizes, such as 4x4 or 16x16, may be employed. 

Note that the zigzag scan serializer 268 may be placed 
either before or after the quantizer 266. The net results are 
the same. 

In any case, the stream of quantized coefficients is pro- 
vided to a variable length coder 269. The variable length 
coder 269 may make use of run-length encoding of zeros 
followed by encoding. This technique is discussed in detail 
in aforementioned U.S. Pat. Nos. 5,021,891, 5,107,345 and 
5,452,104, and in pending U.S. patent application Ser. No. 
<000163>, which is incorporated by reference and is sum- 
marized herein. A run-length coder takes the quantized 
coefficients and notes the run of successive coefficients from 
the non-successive coefficients. The successive values are 
referred to as run-length values, and are encoded. The 
non-successive values are separately encoded. In an 
embodiment, the successive coefficients are zero values, and 
the non-successive coefficients are non-zero values. 
Typically, the run length is from 0 to 63 bits, and the size is 
an AC value from 1-10. An end of file code adds an 
additional code — thus, there is a total of 641 possible codes. 

In the decoding process, encoded data in the frequency 
domain is converted back into the pixel domain. A variable 
length decoder 270 produces a run-length and size of the 
data and provides the data to an inverse zigzag scan serial- 
izer 271 that orders the coefficients according to the scan 
scheme employed. The inverse zigzag scan serializer 271 
receives the PQR data to assist in proper ordering of the 
coefficients into a composite coefficient block. The compos- 
ite block is provided to an inverse quantizer 272, for undoing 
the processing due to the use of the frequency weighting 
masks. 

A finger printer (H20) 273 is then performed on the 
encoded data. The finger printer places a watermark or other 
identifier information on the data. The watermark may be 
recovered at a later time, to reveal identifier information. 
Identifier information may include information such as 
where and when material was played, and who was autho- 
rized to play such material. Following the finger printer 273, 
a decoder data process 274 (IDQT/IDCT) is commenced, 
which is described in detail with respect to FIG. 4. After the 
data is decoded, the data is sent to the Frame Buffer Interface 
(FBI) 278. The FBI is configured to read and write uncom- 
pressed data a frame at a time. In an embodiment, the FBI 
has a capacity of four frames, although it is contemplated 
that the storage capacity may be varied. 

Referring now to FIG. 2c, a flow diagram showing details 
of the operation of the block size assignment element 258 is 
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provided. The algorithm uses the variance of a block as a 
metric in the decision to subdivide a block. Beginning at step 
202, a 16x16 block of pixels is read. At step 204, the 
variance, vl6, of the 16x16 block is computed. The variance 
5 is computed as follows: 

varv.4y! v ^;-}4y y * j" 

/V ■■ <™y ± — J j /Y '- jC^. ' ' ! 

10 

where N=16, and x t y is the pixel in the \ th row, ] th column 
within the NxN block. At step 206, first the variance 
threshold T16 is modified to provide a new threshold T'16 if 
the mean value of the block is between two predetermined 

15 values, then the block variance is compared against the new 
threshold, T16. 

If the variance vl6 is not greater than the threshold T16, 
then at step 208, the starting address of the 16x16 block is 
written, and the R bit of the PQR data is set to 0 to indicate 

20 that the 16x16 block is not subdivided. The algorithm then 
reads the next 16x16 block of pixels. If the variance vl6 is 
greater than the threshold T16, then at step 210, the R bit of 
the PQR data is set to 1 to indicate that the 16x16 block is 
to be subdivided into four 8x8 blocks. 

25 The four 8x8 blocks, i=l:4, are considered sequentially 
for further subdivision, as shown in step 212. For each 8x8 
block, the variance, v8 t -, is computed, at step 214. At step 
216, first the variance threshold T8 is modified to provide a 
new threshold T8 if the mean value of the block is between 

30 two predetermined values, then the block variance is com- 
pared to this new threshold. 

If the variance v8, is not greater than the threshold T8, 
then at step 218, the starting address of the 8x8 block is 
written, and the corresponding Q bit, Q i7 is set to 0. The next 

35 8x8 block is then processed. If the variance v8 y is greater 
than the threshold T8, then at step 220, the corresponding Q 
bit, Q,-, is set to 1 to indicate that the 8x8 block is to be 
subdivided into four 4x4 blocks. 

The four 4x4 blocks, j —1:4, are considered sequentially 

40 for further subdivision, as shown in step 222. For each 4x4 
block, the variance, v4^, is computed, at step 224. At step 
226, first the variance threshold T4 is modified to provide a 
new threshold T4 if the mean value of the block is between 
two predetermined values, then the block variance is com- 

45 pared to this new threshold. 

If the variance v4 iJ is not greater than the threshold T4, 
then at step 228, the address of the 4x4 block is written, and 
the corresponding P bit, P^, is set to 0. The next 4x4 block 
is then processed. If the variance v4 ij - is greater than the 

50 threshold T4, then at step 230, the corresponding P bit, P», 
is set to 1 to indicate that the 4x4 block is to be subdivided 
into four 2x2 blocks. In addition, the address of the 4 2x2 
blocks is written. 

The thresholds T16, T8, and T4 may be predetermined 

55 constants. This is known as the hard decision. Alternatively, 
an adaptive or soft decision may be implemented. The soft 
decision varies the thresholds for the variances depending on 
the mean pixel value of the 2Nx2N blocks, where N can be 
8, 4, or 2. Thus, functions of the mean pixel values, may be 

60 used as the thresholds. 

For purposes of illustration, consider the following 
example. Let the predetermined variance thresholds for the 
Y component be 50, 1100, and 880 for the 16x16, 8x8, and 
4x4 blocks, respectively. In other words, T16=50, T8=1100, 

65 and T16=880. Let the range of mean values be 80 and 100. 
Suppose the computed variance for the 16x16 block is 60. 
Since 60 and its mean value 90 are greater than T16, the 
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16x16 block is subdivided into four 8x8 sub-blocks. Sup- 
pose the computed variances for the 8x8 blocks are 1180, 
935, 980, and 1210. Since two of the 8x8 blocks have 
variances that exceed T8, these two blocks are further 
subdivided to produce a total of eight 4x4 sub-blocks. 
Finally, suppose the variances of the eight 4x4 blocks are 
620, 630, 670, 610, 590, 525, 930, and 690, with the first 
four corresponding means values 90, 120, 110, 115. Since 
the mean value of the first 4x4 block falls in the range (80, 
100), its threshold will be lowered to T'4=200 which is less 
than 880. So, this 4x4 block will be subdivided as well as the 
seventh 4x4 block. The resulting block size assignment is 
illustrated in FIG. 11a. The corresponding quad-tree decom- 
position is illustrated in FIG. lib. The PQR data generated 
by this block size assignment is illustrated in FIG. 11c. 

Note that a similar procedure is used to assign block sizes 
for the color components C ± and C 2 . The color components 
may be decimated horizontally, vertically, or both. 
Additionally, note that although block size assignment has 
been described as a top down approach, in which the largest 
block (16x16 in the present example) is evaluated first, a 
bottom up approach may instead be used. The bottom up 
approach will evaluate the smallest blocks (2x2 in the 
present example) first. 

The PQR data, along with the addresses of the selected 
blocks, are provided to a DCT/DQT element 262. The 
DCT/DQT element 262 uses the PQR data to perform 
discrete cosine transforms of the appropriate sizes on the 
selected blocks. Only the selected blocks need to undergo 
DCT processing. The DQT is also used for reducing the 
redundancy among the DC coefficients of the DCTs. A DC 
coefficient is encountered at the top left comer of each DCT 
block. The DC coefficients are, in general, large compared to 
the AC coefficients. The discrepancy in sizes makes it 
difficult to design an efficient variable length coder. 
Accordingly, it is advantageous to reduce the redundancy 
among the DC coefficients. The DQT element performs 2-D 
DCTs on the DC coefficients, taken 2x2 at a time. Starting 
with 2x2 blocks within 4x4 blocks, a 2-D DCT is performed 
on the four DC coefficients. This 2x2 DCT is called the 
differential quad-tree transform, or DQT, of the four DC 
coefficients. Next, the DC coefficient of the DQT along with 
the three neighboring DC coefficients with an 8x8 block are 
used to compute the next level DQT. Finally, the DC 
coefficients of the four 8x8 blocks within a 16x16 block are 
used to compute the DQT. Thus, in a 16x16 block, there is 
one true DC coefficient and the rest are AC coefficients 
corresponding to the DCT and DQT. 

Within a frame, each 16x16 block is computed indepen- 
dently. Accordingly, the processing algorithm used for a 
given block may be changed as necessary, as determined by 
the PQR. 

FIG. 3 is a block diagram illustrating computation of the 
DCT/DQT and the IDQT/IDCT of a block of encoded data 
300. In encode mode, as illustrated in FIG. 3, the encoded 
data is initially in the pixel domain. As the encoded data is 
processed through intermediate steps, the encoded data is 
transformed into the frequency domain. In decode mode, the 
encoded data is initially in the frequency domain. As the 
encoded data is processed through intermediate steps, the 
encoded data is transformed into the pixel domain. 

Referring to FIG. 3, at least one MxN block of encoded 
data is stored in a transpose RAM 304. The transpose RAM 
304 may contain one or more blocks of MxN data. In an 
embodiment with two blocks of encoded data, one is con- 
figured to contain a current MxN block of data 308, and the 
other configure to contain a next block of MxN data 312. 



15 



25 



30 



35 



40 



45 



50 



55 



60 



65 



The blocks of data 308 and 312 are transferred to transpose 
RAM 304 from the block size assignment 208 as illustrated 
in FIG. 2a (in encode mode) or the fingerprinter 220 as 
illustrated in FIG. 2b (in decode mode). In an embodiment, 
the transpose RAM 304 may be a dual port RAM, such that 
a transpose RAM interface 316 processes the current block 
of data 308 and receives the next block of data from the 
fingerprinter 220. The transpose RAM interface 316 controls 
timing and may have buffered memory to allow blocks of 
data to be read from and written to the transpose RAM 304. 
In an embodiment, the transpose RAM 304 and transpose 
RAM interface 316 may be responsive to one or more 
control signals from a control sequencer 324. 

Encoded data enters a data processor 328 from transpose 
RAM 304 (or through the transpose RAM interface 316) 
into one or more input registers 332. In an embodiment, 
there are 16 input registers 332. In an embodiment, the data 
processor 328 first processes column data, followed by row 
data, as illustrated in FIG. 1. The data processor 328 may 
alternatively process the rows followed by the columns, 
however, the following description assumes that column 
data is processed prior to row data. The input register 332 
comprises of a single column encoded data of the 16x16 
block. The data processor 328 computes the transform by 
performing mathematical operations on the encoded data, 
column by column, and writes the data back into the 
transpose RAM 304. After the columns of data are 
processed, the data processor 328 processes each row of 
encoded data. After each row of encoded data is processed, 
the data processor 328 outputs the data through an output 
register 352. 

In an embodiment, the block of data is a 16x16 block of 
encoded data, although it is contemplated that any size block 
of data may be used, such as 32x32, 8x8, 4x4, or 2x2, or 
combinations thereof. Accordingly, as the data processor 
328 is processing a block of data from the transpose RAM 
304 (for example, the current MxN block of data 308), the 
transpose RAM interface 316 receives the next block of data 
312 from the BSA 208 (encode mode) or the fingerprinter 
220 (decode mode). When the data processor 328 has 
completed processing of the current block of data 308, the 
transpose RAM interface 316 reads the next block of data 
312 from the transpose RAM 304 interface and loads it into 
data processor 328. As such, data from the transpose RAM 
304 toggles between the current block of data 308 and the 
next block of data 312 as dictated by the transpose RAM 
interface 316 and the control sequencer 324. 

The data processor 328 comprises input register 332, at 
least one butterfly processor within a monarch butterfly 
cluster 336 and at least one intermediate data register 340. 
Data processor 328 may also comprise a holding register 
344, a write mutliplexer 348, and output data register 352. 
Monarch butterfly cluster 336 may further comprise a first 
input multiplexer 356, and intermediate data register 340 
further comprises a second input multiplexer 360. The 
aforementioned components of data processor 328 are pref- 
erably controlled by the control sequencer 324. 

In operation, for a given column or row of data, the input 
register 332 is configured to receive the encoded data 
through the transpose RAM interface 316 from the transpose 
RAM 304. The control sequencer 324 enables certain 
addresses of the input register to send the data through input 
multiplexer 356. The data input is resequenced as by selec- 
tion through input multiplexer 356 such that the proper pairs 
of encoded data are selected for mathematical operations. 
Controlled by the control sequencer 324, the input multi- 
plexer 356 passes the data to the monarch butterfly cluster 
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336. The monarch butterfly cluster 336 comprises one or 
more butterfly processors. In an embodiment, the monarch 
butterfly cluster 336 comprises four individual butterfly 
processors 364, 368, 372, and 376, and control sequencer 
324 routes encoded data through input multiplexer 356 to 
the appropriate butterfly processor. 

Each individual butterfly processor 364, 368, 372 or 376 
is capable of performing one -dimensional transforms, such 
as the DCT, IDCT, DQT and IDQT. A one -dimensional 
transform typically involve arithmetic operations, such as 
simple adders, subtractors, or a multiplier. After a portion of 
a one -dimensional transform is performed on a pair of data 
elements, the resulting output is transferred to the interme- 
diate data register 340. Intermediate data register 340 may 
be responsive to the control sequencer 324. The control 
sequencer may be a device such as a state machine, a 
microcontroller, or a programmable processor. In an 
embodiment in which the intermediate data register 340 is 
responsive to the control sequencer 324, selected data ele- 
ments stored in the intermediate data register 340 are fed 
back to appropriate butterfly processor using a feedback path 
380 and through first input multiplexer 356, to be processed 
again (i.e., another portion of a one-dimensional transform). 
This feedback loop continues until all one -dimensional 
processing for the encoded data is completed. When the 
processing of the data is completed, the data from the 
intermediate data register 340 is written to the WRBR 
holding register 344. If the data being processed is column 
data, the data is written from the WRBR holding register 344 
through the write multiplexer 348 and stored back into the 
transpose RAM 304, so that row processing may begin. The 
write multiplexer 348 is controlled to resequence the pro- 
cessed column data back into its original sequence. If the 
holding register data is row data (and thus, all of the column 
processing is complete), the data is routed to the output 
register 352. The control sequencer 324 may then control 
output of data from the daisy chain multiplexer and output 
data register 352. 

FIG. 4 illustrates a DCT trellis that may be implemented 
in encode mode by the data path processor 328 of FIG. 3. 
Similarly, FIG. 5 illustrates an IDCT trellis that may be 
implemented in decode mode by the data path processor 328 
of FIG. 3. As dictated by the PQR data and/or depending on 
the type of computation being performed, the control 
sequencer 324 may select different pairs of elements of 
encoded data to combine and performs portions of a one- 
dimensional transform. For example, in the trellis of FIG. 4, 
eight operations occur in column 404. The operations illus- 
trated are as follows: x(0)+x(7), x(l)+x(6), x(3)+x(4), x(2)+ 
x(5), x(0)-x(7), x(l)-x(6), x(3)-x(4) and x(2)-x(5). Each of 
the butterfly processors 364, 368, 372 and 376 (as shown 
FIG. 3) handles one of the four operations in a given clock 
cycle. Thus, for example, butterfly processor 364 computes 
the operation of x(0)+x(7) and x(0)-x(7), butterfly processor 
368 computes the operation of x(l)+x(6) and x(l)-x(6), 
butterfly processor 372 computes the operation of x(3)+x(4) 
and x(3)-x(4), and butterfly processor 376 computes the 
operation of x(2)+x(5) and x(2)-x(5), all in the same clock 
cycle. The results of each of these operations may be 
temporarily stored in a pipeline register or in the interme- 
diate data register 340, and then routed to the input multi- 
plexer 360. Operation of the pipeline register is described in 
the specification with respect to FIG. 9c and 9d. 

Optionally, in the next clock cycle, the remaining four 
multiplication operations are computed using the same four 
butterfly processors. Accordingly, butterfly processor 364 
computes [x(0)-x(7)]*( 1 /2C 1 16 ), butterfly processor 368 
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computes [x(l)-x(6)]* ( 1 /2C 3 16 ), butterfly processor 372 
computes [x(3)-x(4)]*(V£C 7 16 ) and butterfly processor 376 
computes [x(2)-x(5)]*( 1 /2C 5 16 ). The results of these compu- 
tations are temporarily stored in the intermediate data reg- 
ister 340. As computations are completed, the encoded data 
is not in the same sequence that the encoded data was in 
when originally input. Accordingly, control sequencer 324 
and input multiplexer 356 resequences encoded data, or 
partially processed encoded data after each feed back loop, 
as necessary. 

In the following clock cycle, computations are processed 
for column 408, the results of which are again stored in the 
intermediate data register 340 are fed back through input 
multiplexer 360. Again, the fed back encoded data, now 
partially processed, is resequenced such that the right por- 
tions of encoded data are routed to the appropriate butterfly 
processor. Accordingly, butterfly processor 364 processes 
b(0)+b(2) and b(0)-b(2). Similarly, butterfly processor 368 
computes b(l)+b(3) and b(l)-b(3), butterfly processor 372 
computes b(4)+b(6) and b(4)-b(6)and butterfly processor 
376 computes b(5)+b(7) and b(5)-b(7). The resulting com- 
putations are again stored with the intermediate data register 
340 or a pipeline register, and routed through the input 
multiplexer 360. In the next clock cycle, multiplications are 
performed by Vi C 1 s , ViC 3 8 , 1 /2C 1 8 , and h6C? 8 , in the same 
manner as described with respect to column 404. Thus, 
butterfly processor 364 computes b(0)-b(2)* 1 /2 C 1 8 , butterfly 
processor 368 computes b(l)-b(3)*^ C 3 8 , butterfly proces- 
sor 372 computes b(4)-b(6)*V2 C 1 8 , butterfly processor 376 
computes b(5)-b(7)* 1 / 2 C 3 8 . 

In the next clock cycle, computations are processed for 
column 412 for values in the d(0) through d(7) positions are 
computed, the results of which are again stored in the 
intermediate data register 340 and are fed back into input 
multiplexer 360. Accordingly, each butterfly processor com- 
putes each stage of each input, such that butterfly processor 
364 computes the operation of d(0)+d(l) and d(0)-d(l), 
butterfly processor 368 computes the operation of d(2)+d(3) 
and d(2)-d(3), butterfly processor 372 computes the opera- 
tion of d(4)+d(5) and d(4)-d(5), and butterfly processor 376 
computes the operation of d(6)+d(7) and d(6)-d(7), all in the 
same clock cycle. In the following clock cycle, multiplica- 
tions by Vi C 1 4 are computed in the same manner as 
described with respect to columns 404 and 408. 

Column 416 illustrates the next set of mathematical 
operations computed by the butterfly processors in the next 
clock cycle. As shown in the example of FIG. 4 in column 
416, only two operations are needed during this clock cycle: 
namely, the sum of the f(2) and f(3) components, and the 
sum of the f(6) and f(7) components. Accordingly, butterfly 
processor 364 computes f(2)+f(3), and butterfly processor 
368 computes f(6)+f(7). 

In the following clock cycle, the computations expressed 
in column 420 are processed. As such, values for h(4), h(5) 
and h(6) are computed. Accordingly, butterfly processor 364 
computes h(4)+h(6), butterfly processor 368 computes h(5)+ 
h(8), and butterfly processor 372 computes h(5)+h(6). 

As readily observable, FIG. 5 illustrates an IDCT trellis 
that operates in a similar manner, but an opposite sequence 
than the trellis described with respect to FIG. 4. The IDCT 
trellis is utilized in the decode process, as opposed to the 
DCT trellis which operates in the encode process. The 
butterfly processors 364, 368, 372 and 376 operate in the 
same manner as described with respect to FIG. 4, taking 
advantage of efficiencies in parallel processing. Both in the 
encode and decode process, a significant advantage of an 
embodiment is the reuse of the same hardware for each stage 
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of the trellis. Accordingly, the hardware is used for the 
computations illustrated in column 504 is the same as the 
hardware used for computations of columns 508, 512, 516 
and 520. Similarly, the hardware used for the computations 
illustrated in column 404 is the same as the hardware used 5 
for computations of columns 408, 412, 416 and 420. 

Once the final results representing the end of the trellis in 
FIG. 4 are computed, the data is transferred from the 
intermediate data register 340 to the holding register 344. 
The holding register 344 and output data register 352 are 10 
controlled by control sequencer 324. If data is column data, 
the data is transferred to the write multiplexer 348 and stored 
back into the transpose RAM 304. Again, the encoded data 
is resequenced to reflect the original sequence of the 
encoded data. If the data is row data, all computations are 15 
therefore completed, and the data is transferred from the 
holding register 344 to the output data register 352. 

FIG. 6 illustrates an example of a single butterfly proces- 
sor with one or more input and output multiplexers 600. In 
an embodiment, data output from one or more intermediate 20 
data registers 340 (see FIG. 3) are coupled to an input portal 
of input multiplexer 604. In an embodiment, the data output 
from each of the intermediate data registers 340 is input into 
the butterfly processor to a first multiplexer 608 and a second 
multiplexer 612. Data output from the input AR register 332 25 
(see FIG. 3) is also transferred through the input multiplexer 
604. Specifically, the output of AR register AR(0) and AR(8) 
are coupled to the input of multiplexer 616, and the outputs 
of AR(1), AR(8), AR(9) and AR(15) are coupled to the input 
of multiplexer 620. Multiplexers 624 and 628 select either 30 
the signal coming from the AR or the BR register as dictated 
by the control sequencer 324 (illustrated in FIG. 3). 
Accordingly, multiplexer 624 selects either the data from 
multiplexer 608 or 616, and multiplexer 628 selects either 
the data from multiplexer 620 or multiplexer 612. The 35 
outputs of the multiplexers 624 and 628 are thus coupled to 
the input of the individual butterfly processor 632. Butterfly 
processor 632 computes a stage of the DCT/IDCT/DQT/ 
IDQT transform, as described with respect to FIGS. 3, 4 and 
5. The two outputs of the butterfly processor 632, outputs 40 
636 and 638, are each coupled to the input of each inter- 
mediate data multiplexers 642 and 646. Data is then selected 
from the multiplexers 642 and 646 to a bank of intermediate 
registers 650. In an embodiment, there are sixteen such 
intermediate multiplexers and data registers. 45 

FIG. 7 illustrates a block diagram of a write multiplexer. 
As illustrated in FIG. 3, the even outputs of the intermediate 
data register 340 are input into a multiplexer 704, and the 
odd outputs of the intermediate data register 340 are input 
into a multiplexer 708. The data in each of the intermediate 50 
registers are resequenced by multiplexers 704, 708, 712 and 
716 as controlled by the control sequencer 324 illustrated in 
FIG. 3, and stored in 17-bit registers 720 and 724, respec- 
tively. The resequenced data is then stored in the transpose 
RAM 304. 55 

FIG. 8 illustrates operation of each butterfly processor 
800. In an embodiment, four butterfly processors are imple- 
mented. However, it is contemplated that any number of 
butterfly processors may be implemented, subject to timing 
and size constraints. Data enters the butterfly through inputs 60 
804 and 808. In an embodiment, input 804 sometimes 
represents the DC value, and passes through a truncator 812. 
The truncator 812 is responsible for the 1/N function, as 
described with respect to the two-dimensional DCT equation 
infra. The DC value of input 804 is seventeen bits — a single 65 
sign bit plus sixteen integer bits. The truncator 812 truncates 
n bits from the DC value input data to create a truncated DC 
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value 816, where n is four bits if the data being processed is 
a 16x16 block, n is three bits if the data being processed is 
a 8x8 block, n is two bits if the data being processed is a 4x4 
block, and n is one bit if the data being processed is a 2x2 
block. If the input is an AC value, truncator 812 is bypassed 
and routed to a first selector 814. First selector 814 then 
selects either the truncated DC value 816 or the AC value 
from input A 804. In this embodiment, no fractional bits are 
used, although it is contemplated that fractional bits may be 
used. 

The output of first selector 816 is routed to a delay 820 
and a second selector 824. When the output of selector value 
816 is routed to delay 820, the truncated DC value is may be 
held for a clock cycle before being routed to second selector 
824. In an embodiment, delay 820 is a register. Selection of 
data in second selector 824 is a function of the type of 
mathematical operation that is to be performed on the data. 
A control word 826, preferably routed from the control 
sequencer, triggers second selector 824. As illustrated 
throughout FIG. 8, control word 826 provides control for a 
number of components. Again depending upon the type of 
mathematical operation to be performed, the data then 
passes to an adder 832 or a subtractor 836. A third selector 
828 also receives the delayed output value from the delay 
820, along with input 808. Again, selection of data in third 
selector 828 is a function of the type of mathematical 
operation that is to be performed on the data. 

As the data is either added or subtracted, the data is then 
passed to either a fourth selector 840 or a fifth selector 844 
for output from the butterfly processor 800. Input 804 is also 
passed to fourth selector 840, and input 808 is passed to fifth 
selector 844. In encode mode, the data may also be routed 
to sixth selector 848. In an embodiment, in encode mode, 
data is routed through an encode delay 852 before being 
routed to the sixth selector 848. 

The second input, input 808, passes through the third 
selector 828 and the sixth selector 848. If input 808 is 
selected by sixth selector 848, the data is routed to a 
multiplier 856, where input 808 is multiplied by a scalar 860. 
The multiplication process with scalar 860 scales the data to 
produce a scaled output 864. In an embodiment, the scalar 
860 is selected based on B. G. Lee's algorithm. In an 
embodiment, the scaled output 864 is then routed to a 
formatter 868. The formatter 868 rounds and saturates the 
data from a twenty-four bit format, a sign bit, sixteen integer 
bits and seven fractional bit, to a seventeen bit format. Thus, 
the formatted scaled output 872 is seventeen bits as opposed 
to twenty bits in length. Treatment of the data in this manner 
allows precision to be maintained when making 
calculations, but using fewer bits to represent the same data, 
which in turn saves hardware space. The formatted scaled 
output 872 is routed through a delay 876 to third selector 828 
and fifth selector 844, for further processing. 

FIGS. 9a-9f illustrate various mathematical operations 
capable of being performed by each butterfly processor. FIG. 
9a illustrates a NO operation that may be performed by the 
butterfly processor 900. Given two inputs, input A (902) and 
input B (904), each input is simply passed through to output 
C (906) and output D (908). Accordingly, in a NO operation, 
C=A and D=B. 

FIG. 9b illustrates an accumulate operation performed by 
the butterfly processor 910. Given two inputs, input A (912) 
and input B (914), output C (916) represents the sum of 
A+B. Input A (912) and input B (914) are combined by an 
adder 913. Output D (918) represents a pass through of input 
B (914). Accordingly, in an accumulate operation, C=A+B 
and D=B. 
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FIG. 9c illustrates a butterfly DCT operation performed 
by the butterfly processor 920. Given two inputs, input A 
(922) and input B (924), output C (926) represents the sum 
of input A (922) and input B (924), such that C=A+B. Input 
922 and input 924 are combined by an adder 923. Output D 5 
(928) represents a subtracter of input A (922) and B (924) 
and multiplied by coefficient CF (930), such that the D=CFx 
(A-B). Input 924 is subtracted from input 922 by a sub trac- 
tor 925, and then multiplied by a multiplier 927. Optionally, 
pipeline registers 932 and 934 may be used to temporarily 
store the intermediate product until the next clock cycle. 

FIG. 9d illustrates a butterfly IDCT operation performed 
by the butterfly processor 936. Given two inputs, input A 
(938) and input B (940), the output C (942) represents the 
sum of input A (938) and input B (940) multiplied by a 
coefficient CF (943), such that the output C=A+(BxCF). 15 
Input B (940) is multiplied by coefficient CF (943) by 
multiplier 945, and then added to input A (938) by adder 
947. Similarly, output D (944) represents the difference of 
input A (938) and input B (940) multiplied by a coefficient 
CF (943), such that D=A-(BxCF). Input B (940) is multi- 20 
plied by coefficient CF (943) by multiplier 945, and then 
subtracted from input A (938) by subtractor 949. Optionally, 
pipeline registers 946 and 948 may store intermediate prod- 
ucts to be computed in the next clock cycle. 

FIG. 9e illustrates an accumulate register operation per- 2 s 
formed by the butterfly processor 950. Given two inputs, 
input A (952) and input AREG (954), output C (956) 
represents the sum of input A and AREG such that C=A+ 
AREG. As opposed to an input value, AREG may also be a 
value stored from a previous clock cycle in a register 951. 
Input A (952) is added to AREG (954) by adder 953. 30 

FIG. 9f represents a DQT/IDQT operation performed by 
the butterfly processor 958. Given two inputs, input A (960) 
and input B (962), output C (964) represents the sum of 
inputs A and B, such that C=A+B. Similarly, output D (966) 
represents the difference of inputs A and B, such that 35 
D=A-B. Input A (960) and input B (962) are combined by 
an adder 963. Input B (962) is subtracted from input A (960) 
by a subtractor 965. 

The process of calculating a transform of image data 1000 
is illustrated in FIG. 10, and may be implemented in a 40 
structure as described with respect to FIG. 3. The process is 
easily configured for frequency domain techniques such as 
the DCT, IDCT, DQT and IDQT A column or row of data 
initially resides in a transpose RAM 1004 and is transferred 
into a holding register 1008 in the butterfly processor. 45 
Individual data elements of the block of data are selected to 
be combined 1012, and a mathematical operation to be 
performed on the individual data elements is selected 1016. 
Mathematical operations that may be performed are 
described with respect to FIG. 9a-9f, and include no opera- 50 
tion 1020, an accumulate 1024, a DCT butterfly 1028, an 
IDCT butterfly 1032, an accumulate register 1036 and a 
DQT/IDQT butterfly 1040. The results of the mathematical 
operation are temporarily stored 1044. A feedback decision 
1048 is then made based on whether further mathematical 55 
operations are needed. In an embodiment, the feedback 
decision is controlled by the control sequencer, as described 
with respect to FIG. 3. If the data is fed back 1052, the data 
is fed back to the holding register 1008, and the process is 
repeated. If the data is not fed back 1056, the data is 60 
transferred to an output holding register 1060. Another 
decision 1064 is made as to whether additional mathematical 
operations are needed for the column or row of data. If so 
(1068), the column or row of data is transferred to a holder 
1072 and then written back into the transpose RAM 1004. If 65 
not (1076), the block of data is transferred to output data 
registers 1080. 
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As examples, the various illustrative logical blocks, 
flowcharts, and steps described in connection with the 
embodiments disclosed herein may be implemented or per- 
formed in hardware or software with an application-specific 
integrated circuit (ASIC), a programmable logic device, 
discrete gate or transistor logic, discrete hardware 
components, such as, e.g., registers and FIFO, a processor 
executing a set of firmware instructions, any conventional 
programmable software and a processor, or any combination 
thereof. The processor may advantageously be a 
microprocessor, but in the alternative, the processor may be 
any conventional processor, controller, microcontroller, or 
state machine. The software could reside in RAM memory, 
flash memory, ROM memory, registers, hard disk, a remov- 
able disk, a CD-ROM, a DVD-ROM or any other form of 
storage medium known in the art. 

The previous description of the preferred embodiments is 
provided to enable any person skilled in the art to make or 
use the present invention. The various modifications to these 
embodiments will be readily apparent to those skilled in the 
art, and the generic principles defined herein may be applied 
to other embodiments without the use of the inventive 
faculty. Thus, the present invention is not intended to be 
limited to the embodiments shown herein but is to be 
accorded the widest scope consistent with the principles and 
novel features disclosed herein. 

What we claim as our invention is: 

1. An apparatus to determine a transform of a block of 
encoded data, the block of encoded data comprising a 
plurality of data elements, the apparatus comprising: 

an input register configured to receive a predetermined 
quantity of data elements; 

at least one butterfly processor coupled to the input 
register, the butterfly processor configured to perform 
at least one mathematical operation on selected pairs of 
data elements to produce an output of processed data 
elements; 

at least one intermediate register coupled to the butterfly 
processor, the intermediate register configured to tem- 
porarily store the processed data; and 

a feedback loop coupling the intermediate register and the 
butterfly processor, where if enabled, is configured to 
transfer a first portion of processed data elements to the 
appropriate butterfly processor to perform additional 
mathematical operations and, where if disabled, is 
configured to transfer a second portion of processed 
data elements to at least one holding register; 

wherein the holding register is configured to store the 
processed data until all of the first portion data elements 
is processed. 

2. The apparatus set forth in claim 1, further comprising 
at least one input multiplexer coupling the feedback loop 
and the intermediate register, wherein each input multiplexer 
is configured to temporarily select data elements and transfer 
data elements to the appropriate butterfly processor. 

3. The apparatus set forth in claim 1, further comprising 
at least one output multiplexer coupling the butterfly pro- 
cessor and the intermediate register, wherein each output 
multiplexer is configured to temporarily select data elements 
and transfer data elements to the appropriate intermediate 
register. 

4. The apparatus set forth in claim 1, wherein the trans- 
form is selected from the group consisting of: a Discrete 
Cosine Transform (DCT), a Differential Quadtree Transform 
(DQT), an Inverse Discrete Cosine Transform (IDCT) and 
an Inverse Differential Quadtree Transform (IDQT). 
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5. The apparatus set forth in claim 1 wherein the block of portion of a one-dimensional transform on selected 
encoded data may be represented as row data and column pairs of data elements from the column data to produce 
data, and further comprising a transpose random- access an output of first order column data; 

memory (RAM) coupled to the input register, wherein the at least one intermediate register coupled to the butterfly 

transpose RAM is configured to store the row data while the 5 processor, the intermediate register configured to tem- 

column data is being processed, and wherein the transpose porarily store the first order column data; and 

RAM is configured to store the column data while the row a feedback loop coupling the intermediate register and the 

data is being processed. butterfly processor, where if enabled, is configured to 

6. The apparatus set forth in claim 5, wherein the trans- transfer selected data elements of the first order column 
pose RAM is configurable to store two blocks of encoded 10 data to the butterfly processor to perform additional 
data. portions of one -dimensional transforms and, where if 

7. The apparatus set forth in claim 5, further comprising disabled, is configured to transfer the column data to 
a write multiplexer coupling the holding register, wherein the transpose RAM; 

the write multiplexer is configured to resequence data ele- wherein the input register is then configured to receive 

ments to complete a one-dimensional transform. 15 rows of data from the transpose RAM, the butterfly 

8. The apparatus set forth in claim 1 wherein the feedback processor is configured to perform a portion of a one 
loop allows for the same components to be reused irrespec- dimensional transform on selected pairs of data ele- 
tive of block size. ments from the rows of data to produce an output of 

9. The apparatus set forth in claim 1 wherein the feedback first order row data > the intermediate register config- 
loop allows for the same components to be reused irrespec- 9n ured to temporarily store the first order row data, the 
tive of the type of transform. feedback loop configured to transfer selected data ele- 

1ft T , . . r +u • t • i +u ments or the first order row data to the butterfly 

10. lhe apparatus set forth in claim 1 wherein the A r , , A . r J 
r lt , . r .1 .11 processor to perform additional portions or one- 
leedback loop allows tor the same components to be reused ^. . i * r j u t i- L1 , • 

F r , . 1 . F dimensional transforms and, where if disabled, is con- 

irrespective of mathematical operation. fi d tQ transfer the rQW data tQ an t ister 

11. The apparatus set forth in claim 1, further comprising 25 22 The apparatus as set forth in claim 21 wherein the 
a control sequencer coupled to the feedback loop, wherein feedback loop is disabled upon completing a one- 
the control sequencer is configured to enable or disable the dim ensional transform on the column or row data, 
feedback loop. 23. The apparatus set forth in claim 21, further comprising 

12. The apparatus set forth in claim 11, where the control at least one input multiplexer coupling the feedback loop 
sequencer provides the butterfly processor with a unique 30 and the intermediate register, wherein each input multiplexer 
coefficient multiplier. ^ configured to temporarily select data elements and transfer 

13. The apparatus set forth in claim 12, wherein the data elements to the appropriate butterfly processor, 
unique coefficient multiplier is based on B. G. Lee's algo- 2 4. The apparatus set forth in claim 21, further comprising 
rithm. at least one output multiplexer coupling the butterfly pro- 

14. The apparatus set forth in claim 11, where the control 35 cessor and the intermediate register, wherein each output 
sequencer enables certain ones of the input registers based mu iti p lexer is configured to temporarily select data elements 
on a predetermined event. and trans f er data elements to the appropriate intermediate 

15. The apparatus set forth in claim 11, where the control register 

sequencer enables certain ones of the butterfly processors 25 The apparatus set forth in claim 2 1, wherein the 

based on predetermined criteria. 40 transform is selected from the group consisting of: a Dis- 

16. The apparatus set forth in claim 11, where the control crete Cosine Transform ( DCT ), a Differential Quadtree 
sequencer enables certain ones of the intermediate registers Transform (DQT), an Inverse Discrete Cosine Transform 
based on predetermined criteria. (IDCT ) and an Inverse Differential Quadtree Transform 

17. The apparatus set forth in claim 11, where the control (IDQT) 

sequencer enables certain ones of the output registers based 45 2 6. The apparatus set forth in claim 21, wherein the 

on predetermined criteria. transpose RAM is configurable to store two blocks of 

18. The apparatus as set forth in claim 1, wherein the encoded data. 

mathematical operation is from the group consisting of 27. The apparatus set forth in claim 21, further comprising 

addition, multiplication, and subtraction. a write mu iti p l e xer coupling the holding register, wherein 

19. The apparatus as set forth in claim 1, wherein each 50 the write mu iti p l e xer is configured to resequence data ele- 
butterfly processor performs a portion of a one -dimensional ments such that me on e-dimensional transform is completed, 
transform. 28 Xne apparatus set forth in claim 21 wherein the 

20. The apparatus as set forth in claim 1, wherein the feedb ack loop allows for the same components to be reused 
transform of a block of encoded data is computed as a series irrespective of block size, type of transform or type of 
of one-dimensional transforms. 55 mame matical operation. 

21. An apparatus to determine a transform of a block of 29 xhe apparatus set forth in claim 21, further comprising 
encoded data, the block of encoded data capable of being a control seqU encer coupled to the feedback loop, wherein 
represented as row data and column data, each row and me control sequence r is configured to enable or disable the 
column comprising a plurality of data elements, the appa- feedback loop 

ratus comprising: 60 30 Tne apparatus se t forth in claim 29, where the control 

a transpose random access memory (RAM) configured to sequencer provides the butterfly processor with a unique 

store the block of encoded data; coefficient multiplier, 

at least one input register coupled to the transpose RAM, 31. The apparatus set forth in claim 29, wherein the 

the input register configured to receive columns of data unique coefficient multiplier is based on B. G. Lee's algo- 

from the transpose RAM; 65 rithm. 

at least one butterfly processor coupled to the input 32. The apparatus set forth in claim 29, where the control 

register, the butterfly processor configured to perform a sequencer enables certain ones of the input registers, but- 
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terfly processors, intermediate registers, or output registers 
based on predetermined criteria. 

33. The apparatus as set forth in claim 21, wherein the 
mathematical operation is from the group consisting of 
addition, multiplication, and subtraction. 

34. The apparatus as set forth in claim 21, wherein each 
butterfly processor performs a portion of a one -dimensional 
transform. 

35. The apparatus as set forth in claim 21, wherein the 
transform of a block of encoded data is computed as a series 
of one -dimensional transforms. 

36. An apparatus to perform an N dimensional transform 
as a cascade of N one-dimensional transforms on a block of 
encoded data, the encoded data comprising a plurality of 
data elements, the apparatus comprising: 

a cluster of butterfly processors coupled to the input 
register, each butterfly processor configured to perform 
a portion of a one-dimensional transform on selected 
pairs of data elements to produce an output of partially 
processed data comprising a plurality of partially pro- 
cessed data elements; 

at least one intermediate register coupled to each butterfly 
processor, the intermediate register configured to tem- 
porarily store the partially processed data; and 

a feedback loop coupled to the intermediate register and 
the butterfly processor, where the feedback loop is 
enabled as necessary to route selected pairs of the 
partially processed data elements to the appropriate 
butterfly processor to perform additional portions of 
one -dimensional transforms until a one dimensional 
transform is completed. 

37. The apparatus set forth in claim 36, wherein the 
transform is selected from the group consisting of: a Dis- 
crete Cosine Transform (DCT), a Differential Quadtree 
Transform (DQT), an Inverse Discrete Cosine Transform 
(IDCT) and an Inverse Differential Quadtree Transform 
(IDQT). 

38. The apparatus set forth in claim 36 wherein the block 
of encoded data may be represented as row data and column 
data, and further comprising a transpose read-only memory 
(RAM) coupled to the input register, wherein the transpose 
RAM is configured to store the row data while the column 
data is being processed, and wherein the transpose RAM is 
configured to store the column data while the row data is 
being processed. 

39. The apparatus set forth in claim 38, wherein the 
transpose RAM is configurable to store two blocks of 45 
encoded data. 

40. The apparatus set forth in claim 36 wherein the 
feedback loop allows for the same components to be reused 
irrespective of block size, type of transform or type of 
mathematical operation. 

41. The apparatus set forth in claim 36, further comprising 
a control sequencer coupled to the feedback loop, wherein 
the control sequencer is configured to enable or disable the 
feedback loop. 

42. The apparatus set forth in claim 41, where the control 
sequencer provides the butterfly processor with a unique 
coefficient multiplier. 

43. The apparatus set forth in claim 42, wherein the 
unique coefficient multiplier is based on B. G. Lee's algo- 
rithm. 

44. The apparatus set forth in claim 41, where the control 
sequencer enables certain ones of the input registers, but- 
terfly processors, intermediate registers, or output registers 
based on predetermined criteria. 

45. An apparatus to determine the inverse discrete cosine 
transform of a block of encoded data, the block of encoded 
data comprising a plurality of data elements, the apparatus 
comprising: 
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an input register configured to receive a predetermined 
quantity of data elements; 

at least one butterfly processor coupled to the input 
register, the butterfly processor configured to perform 
at least one mathematical operation on selected pairs of 
data elements to produce an output of processed data 
elements; 

at least one intermediate register coupled to the butterfly 
processor, the intermediate register configured to tem- 
porarily store the processed data; and 

a feedback loop coupling the intermediate register and the 
butterfly processor, where if enabled, is configured to 
transfer a first portion of processed data elements to the 
appropriate butterfly processor to perform additional 
mathematical operations and, where if disabled, is 
configured to transfer a second portion of processed 
data elements to at least one holding register; 

wherein the holding register is configured to store the 
processed data until all of the first portion data elements 
is processed. 

46. An apparatus to determine a transform of a block of 
encoded data, the block of encoded data capable of being 
represented as row data and column data, each row and 
column comprising a plurality of data elements, the appa- 
ratus comprising: 

a transpose random- access memory (RAM) configured to 

store the block of encoded data; 
at least one input register coupled to the transpose RAM, 

the input register configured to receive columns of data 

from the transpose RAM; 
at least one butterfly processor coupled to the input 

register, the butterfly processor configured to perform a 

first order transform on selected pairs of data elements 

from the column data to produce an output of first order 

column data; 

at least one intermediate register coupled to the butterfly 
processor, the intermediate register configured to tem- 
porarily store the first order column data; 

a feedback loop coupling the intermediate register and the 
butterfly processor, where if enabled, is configured to 
transfer selected data elements of the first order column 
data to the butterfly processor to perform additional 
transforms and, where if disabled, is configured to 
transfer the column data to the transpose RAM; and 

a control sequencer coupled to the feedback loop, wherein 
the control sequencer is configured to enable or disable 
the feedback loop 

wherein the input register is then configured to receive 
rows of data from the transpose RAM, the butterfly 
processor is configured to perform a first order trans- 
form on selected pairs of data elements from the rows 
of data to produce an output of first order row data, the 
intermediate register is configured to temporarily store 
the first order row data, the feedback loop is configured 
to transfer selected data elements of the first order row 
data to the butterfly processor to perform additional 
transforms and, where if disabled, is configured to 
transfer the row data to an output register. 

47. A method to determine a transform of a block of 
encoded data, the block of encoded data comprising a 
plurality of data elements, the method comprising: 

(a) receiving a predetermined quantity of data elements; 

(b) performing at least one mathematical operation on 
selected pairs of data elements to produce an output of 
processed data elements; 
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(c) making a determination as to whether any of the 
processed data elements require additional mathemati- 
cal operations; 

(d) selecting a first portion of processed data elements that 
require additional mathematical operations; 

(e) selecting a second portion of processed data elements 
that do not require additional mathematical operations; 

(f) performing at least one mathematical operation on 
selected pairs of the first portion of processed data 
elements to produce a second output of processed data 
elements; and 

(g) storing the second portion of processed data elements 
until all of the first portion of data elements is pro- 
cessed. 

48. The method set forth in claim 47, further comprising: 

(h) repeating steps (c), (d), (e), (f) and (g) as necessary. 

49. The method set forth in claim 47, further comprising: 

(i) outputting the block of encoded data when all of the 
data elements of the block of encoded data do not 
require additional mathematical operations. 

50. The method set forth in claim 47, wherein the trans- 
form is selected from the group consisting of: a Discrete 
Cosine Transform (DCT), a Differential Quadtree Transform 
(DQT), an Inverse Discrete Cosine Transform (IDCT) and 
an Inverse Differential Quadtree Transform (IDQT). 

51. The method set forth in claim 47 wherein the block of 
encoded data may be represented as row data and column 
data, and further comprising: 

storing the row data while the column data is being 

processed; and 
storing the column data while the row data is being 

processed. 

52. The method set forth in claim 47, further comprising 
resequencing data elements before the step of storing, such 
that subsequent delivery of data elements is performed in an 
efficient manner. 

53. The method set forth in claim 47, further comprising 
controlling steps (a), (b), (c), (d), (e), (f), (g), and (h) based 
upon predetermined criteria. 

54. The method set forth in claim 53, further comprising 
providing a unique coefficient multiplier to certain data 
elements based upon predetermined criteria. 

55. The apparatus set forth in claim 54, wherein the 
unique coefficient multiplier is based on B. G. Lee's algo- 
rithm. 

56. The method set forth in claim 47, wherein the math- 
ematical operation is from the group consisting of addition, 
multiplication, and subtraction. 

57. The method as set forth in claim 47, wherein each 
butterfly processor performs a portion of a one -dimensional 
transform. 

58. The method as set forth in claim 47, wherein the 
transform of a block of encoded data is computed as a series 
of one -dimensional transforms. 

59. A computer readable medium containing construc- 
tions for controlling a computer system to perform a method, 
the method comprising: 

(a) receiving a predetermined quantity of data elements; 

(b) performing at least one mathematical operation on 
selected pairs of data elements to produce an output of 
processed data elements; 

(c) making a determination as to whether any of the 
processed data elements require additional mathemati- 
cal operations; 

(d) selecting a first portion of processed data elements that 
require additional mathematical operations; 
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(e) selecting a second portion of processed data elements 
that do not require additional mathematical operations; 

(f) performing at least one mathematical operation on 
selected pairs of the first portion of processed data 
elements to produce a second output of processed data 
elements; and 

(g) storing the second portion of processed data elements 
until all of the first portion of data elements is pro- 
cessed. 

60. An apparatus to determine a transform of a block of 
encoded data, the block of encoded data comprising a 
plurality of data elements, the apparatus comprising: 

(a) means for receiving a predetermined quantity of data 
elements; 

(b) means for performing at least one mathematical opera- 
tion on selected pairs of data elements to produce an 
output of processed data elements; 

(c) means for making a determination as to whether any 
of the processed data elements require additional math- 
ematical operations; 

(d) means for selecting a first portion of processed data 
elements that require additional mathematical opera- 
tions; 

(e) means for selecting a second portion of processed data 
elements that do not require additional mathematical 
operations; 

(f) means for performing at least one mathematical opera- 
tion on selected pairs of the first portion of processed 
data elements to produce a second output of processed 
data elements; and 

(g) means for storing the second portion of processed data 
elements until all of the first portion of data elements is 
processed. 

61. The apparatus set forth in claim 47, further compris- 
ing: 

(h) means for repeating steps (c), (d), (e), (f) and (g) as 
necessary. 

62. The apparatus set forth in claim 47, further compris- 
ing: 

(i) means for outputting the block of encoded data when 
all of the data elements of the block of encoded data do 
not require additional mathematical operations. 

63. The apparatus set forth in claim 47, wherein the 
transform is selected from the group consisting of: a Dis- 
crete Cosine Transform (DCT), a Differential Quadtree 
Transform (DQT), an Inverse Discrete Cosine Transform 
(IDCT) and an Inverse Differential Quadtree Transform 
(IDQT). 

64. The apparatus set forth in claim 47 wherein the block 
of encoded data may be represented as row data and column 
data, and further comprising: 

means for storing the row data while the column data is 

being processed; and 
means for storing the column data while the row data is 

being processed. 

65. The apparatus set forth in claim 47, further comprising 
means for resequencing data elements before the step of 
storing, such that subsequent delivery of data elements is 
performed in an efficient manner. 

66. The apparatus set forth in claim 47, further comprising 
means for controlling elements (a), (b), (c), (d), (e), (f), (g), 
and (h) based upon predetermined criteria. 

67. The apparatus set forth in claim 66, further comprising 
providing a unique coefficient multiplier to certain data 
elements based upon predetermined criteria. 
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68. The apparatus set forth in claim 67, wherein the 
unique coefficient multiplier is based on B. G. Lee's algo- 
rithm. 

69. The apparatus set forth in claim 60, wherein the 
mathematical operation is from the group consisting of 
addition, multiplication, and subtraction. 

70. The apparatus as set forth in claim 60, wherein each 
butterfly processor performs a portion of a one -dimensional 
transform. 

71. An apparatus to determine a transform of encoded 
data, the encoded data comprising a plurality of data ele- 
ments in the pixel domain, the apparatus comprising: 

a block size assigner configured to receive the plurality of 
data elements and group the elements into a plurality of 
groups of data elements in the pixel domain; 
a DCT/DQT transformer configured to transform the data 
elements from the pixel domain to the frequency 
domain, the transformer further comprising: 
an input register configured to receive a predetermined 

quantity of data elements of the group; 
at least one butterfly processor coupled to the input 
register, the butterfly processor configured to per- 
form at least one mathematical operation on selected 
pairs of data elements to produce an output of 
processed data elements; 
at least one intermediate register coupled to the butter- 
fly processor, the intermediate register configured to 
temporarily store the processed data; and 
a feedback loop coupling the intermediate register and 
the butterfly processor, where if enabled, is config- 
ured to transfer a first portion of processed data 
elements to the appropriate butterfly processor to 
perform additional mathematical operations and, 
where if disabled, is configured to transfer a second 
portion of processed data elements to at least one 
holding register; 
wherein the holding register is configured to store the 
processed data until all of the first portion data 
elements is processed; 
a quantizer configured to quantize the frequency domain 
elements to emphasize those elements that are more 
sensitive to the human visual system, and de -emphasize 
those elements that are less sensitive to the human 
visual system; 

a serializer configured to produce a serialized stream of 
frequency domain elements; and 

a variable length coder configured to determine succes- 
sive frequency domain elements and non-successive 
frequency domain elements. 

72. The apparatus set forth in claim 71, further comprising 
at least one input multiplexer coupling the feedback loop 
and the intermediate register, wherein each input multiplexer 
is configured to temporarily select data elements and transfer 
data elements to the appropriate butterfly processor. 

73. The apparatus set forth in claim 71, further comprising 
at least one output multiplexer coupling the butterfly pro- 
cessor and the intermediate register, wherein each output 
multiplexer is configured to temporarily select data elements 
and transfer data elements to the appropriate intermediate 
register. 

74. The apparatus set forth in claim 71 wherein the block 
of encoded data may be represented as row data and column 
data, and further comprising a transpose random- access 
memory (RAM) coupled to the input register, wherein the 
transpose RAM is configured to store the row data while the 
column data is being processed, and wherein the transpose 
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RAM is configured to store the column data while the row 
data is being processed. 

75. The apparatus set forth in claim 74, wherein the 
transpose RAM is configurable to store two blocks of 
encoded data. 

76. The apparatus set forth in claim 74, further comprising 
a write multiplexer coupling the holding register, wherein 
the write multiplexer is configured to resequence data ele- 
ments to complete a one -dimensional transform. 

77. The apparatus set forth in claim 71 wherein the 
feedback loop allows for the same components to be reused 
irrespective of block size. 

78. The apparatus set forth in claim 71, further comprising 
a control sequencer coupled to the feedback loop, wherein 
the control sequencer is configured to enable or disable the 
feedback loop. 

79. The apparatus set forth in claim 78, where the control 
sequencer provides the butterfly processor with a unique 
coefficient multiplier. 

80. The apparatus set forth in claim 78, where the control 
sequencer enables certain ones of the input registers based 
on a predetermined event. 

81. The apparatus set forth in claim 78, where the control 
sequencer enables certain ones of the butterfly processors 
based on predetermined criteria. 

82. The apparatus set forth in claim 78, where the control 
sequencer enables certain ones of the intermediate registers 
based on predetermined criteria. 

83. The apparatus set forth in claim 78, where the control 
sequencer enables certain ones of the output registers based 
on predetermined criteria. 

84. The apparatus as set forth in claim 71, wherein the 
mathematical operation is from the group consisting of 
addition, multiplication, and subtraction. 

85. The apparatus as set forth in claim 71, wherein each 
butterfly processor performs a portion of a one-dimensional 
transform. 

86. A method of transforming encoded data from the pixel 
domain to the frequency domain, the encoded data compris- 
ing a plurality of data elements, the method comprising: 

(a) grouping the plurality of data elements in the pixel 
domain into a plurality of blocks, each block compris- 
ing a plurality of data elements in the pixel domain; 

(b) performing at least one mathematical operation on 
selected pairs of data elements to produce an output of 
processed data elements; 

(c) making a determination as to whether any of the 
processed data elements require additional mathemati- 
cal operations; 

(d) selecting a first portion of processed data elements that 
require additional mathematical operations; 

(e) selecting a second portion of processed data elements 
that do not require additional mathematical operations; 

(f) performing at least one mathematical operation on 
selected pairs of the first portion of processed data 
elements to produce a second output of processed data 
elements; 

(g) storing the second portion of processed data elements 
until all of the first portion of data elements is pro- 
cessed; 

(h) repeating steps (c), (d), (e), (f) and (g), as necessary, 
until all of the data elements do not require additional 
mathematical operations and are converted to fre- 
quency domain elements; 

(i) quantizing the frequency domain data elements to 
emphasize those elements that are more sensitive to the 



US 6,8' 

25 

human visual system and de-emphasize those elements 
that are less sensitive to the human visual system; 

(j) serializing the quantized frequency domain data ele- 
ments to produce a serialized stream of frequency 
domain elements; and 

(k) coding the serialized frequency domain elements to 
determine successive frequency domain elements and 
non-successive frequency domain elements. 

87. The method set forth in claim 86 wherein the block of 
encoded data may be represented as row data and column 
data, and further comprising: 

storing the row data while the column data is being 
processed; and 

storing the column data while the row data is being 
processed. 
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88. The method set forth in claim 86, further comprising 
controlling steps (a), (b), (c), (d), (e), (f), (g), and (h) based 
upon required control signals. 

89. The method set forth in claim 88, further comprising 
providing a unique coefficient multiplier to certain data 
elements based upon predetermined criteria. 

90. The method as set forth in claim 86, wherein each 
butterfly processor performs a portion of a one -dimensional 

10 transform. 

91. The method as set forth in claim 86, wherein the 
transform of a block of encoded data is computed as a series 
of one -dimensional transforms. 



